Abstract
Primary immunodeficiency (PID) is characterised by recurrent and often life-threatening infections, autoimmunity and cancer, and it presents major diagnostic and therapeutic challenges. Although the most severe forms present in early childhood, the majority of patients present in adulthood, typically with no apparent family history and a variable clinical phenotype of widespread immune dysregulation: about 25% of patients have autoimmune disease, allergy is prevalent, and up to 10% develop lymphoid malignancies. Consequently, in sporadic PID genetic diagnosis is difficult and the role of genetics is not well defined. We addressed these challenges by performing whole genome sequencing (WGS) of a large PID cohort of 1,318 subjects. Analysis of coding regions of 886 index cases found disease-causing mutations in known monogenic PID genes in 8.2%, while a Bayesian approach (BeviMed1) identified multiple potential new disease-associated genes. Exploration of the non-coding space revealed deletions in regulatory regions which contribute to disease causation. Finally, a genome-wide association study (GWAS) identified novel PID-associated loci and uncovered evidence for co-localisation of, and interplay between, novel high penetrance monogenic variants and common variants (at the PTPN2 and SOCS1 loci). This begins to explain the contribution of common variants to variable penetrance and phenotypic complexity in PID. Thus, a cohort-based WGS approach to PID diagnosis can increase diagnostic yield while deepening our understanding of the key pathways determining variation in human immune responsiveness.
The phenotypic heterogeneity of PID leads to diagnostic difficulty, and almost certainly to an underestimation of its true incidence. Our cohort reflects this heterogeneity, though it is dominated by adult onset, sporadic antibody deficiency associated PID (AD-PID: comprising Common Variable Immunodeficiency (CVID), Combined Immunodeficiency (CID) and isolated antibody deficiency). Identifying a specific genetic cause of PID can facilitate definitive treatment including haematopoietic stem cell transplantation, genetic counselling, and the possibility of gene-specific therapy2–4 while contributing to our understanding of the human immune system5. Unfortunately, only 29% of patients with PID receive a genetic diagnosis6. The lowest diagnosis rate is in patients who present as adults, have no apparent family history, and in whom matching the clinical phenotype to a known genetic cause is difficult, as the latter can be surprisingly variable even in patients with the same genetic defect (in the UK PID cohort 78% of cases are adult and 76% sporadic6). Moreover, while over 300 monogenic causes of PID have been described7, the genotype-phenotype correlation in PID is complex. In CVID, for example, pathogenic variants in TACI (TNFRSF13B) occur in 10% of patients but typically have low disease penetration, appearing to act as disease modifiers8. Furthermore, a common variant analysis of CVID identified two disease-associated loci, raising the possibility that common variants may impact upon clinical presentation9. We therefore investigated whether applying WGS across a “real world” PID cohort might illuminate the complex genetics of the range of conditions collectively termed PID.
Patient cohort
974 sporadic and familial PID patients, and 344 unaffected relatives, were recruited by collaborators as part of the United Kingdom NIHR BioResource - Rare Diseases program (NBR-RD; Supplementary Note). Of these, 886 were index cases who fell into one of the diagnostic categories of the European Society for Immunodeficiencies (ESID) registry diagnostic criteria (Fig. 1a; Supplementary Table 1). This cohort represents a third of CVID and half of CID patients registered in the UK10. Paediatric and familial cases were less frequent, in part reflecting prior genetic testing of more severe cases (Supplementary Fig. 1). Clinical phenotypes were dominated by adult-onset sporadic AD-PID: all had recurrent infections, 28% had autoimmunity, and 8% had malignancy (Fig. 1a-b, Supplementary Table 2), mirroring the UK national PID registry6.
Identification of Pathogenic Variants in Known Genes
We analysed coding regions of genes previously causally associated with PID11 (Methods). We identified 85 potentially causal variants in 73 index cases (8.2%) across 39 genes implicated in monogenic disease (Fig. 1c; Supplementary Table 3). 60 patients (6.8%) had a previously reported pathogenic variant in the disease modifier TACI (TNFRSF13B), increasing the diagnostic yield to 15.0% (133 patients). Interestingly, 5 patients with a monogenic diagnosis (in BTK, LRBA, MAGT1, RAG2, SMARCAL1) also had a pathogenic TACI variant. The diagnostic yield rose to 17.0% (151 patients) once novel causal variants in NFKB1 and ARPC1B, associated with PID only after our initial analysis, were included. Of the 85 monogenic variants we reported, 51 (60%) had not been previously described (Supplementary Table 3), and 4 were structural variants, including a single exon deletion, unlikely to have been detected by whole exome sequencing12.
We observed divergence from an expected clinical phenotype for causal variants in 14 genes: for instance, only 4 of the 8 STAT1 patients had the pathognomonic chronic mucocutaneous candidiasis13,14. A more remarkable example of phenotypic complexity was the case of a 40 year-old patient presenting with specific antibody deficiency and a premature stop variant at Arg328 in X-linked IL2RG, a defect expected to cause absent T and NK cells and death in infancy. We found that the mild phenotype could be ascribed to several independent somatic changes that reversed the premature stop codon, restoring both T and NK cell lineages (Fig. 1d and Supplementary Fig. 2).
Since many PID-associated genes were initially discovered in a small number of typically familial cases, it is perhaps not surprising that the phenotypes described do not reflect true clinical diversity. Thus, a cohort-based WGS approach to PID can provide a significant diagnostic yield even in a predominantly pre-screened and sporadic cohort, allows diagnoses which are not constrained by pre-existing assumptions about genotype-phenotype relationships, and suggests caution in the use of clinical phenotype in targeted gene screening and when interpreting PID genetic data.
An approach to identifying new PID-associated genes in a WGS cohort
We next sought to determine whether the cohort-based WGS approach could identify new genetic associations with PID. We developed a Bayesian inference procedure, named BeviMed1, to determine posterior probabilities of association (PPA) between each gene and case/control status of the 886 index cases and 9,283 unrelated controls (Methods). For each gene, we analysed variants with gnomAD minor allele frequency (MAF) <0.001 and Combined Annotation Dependent Depletion (CADD) score >=10. Genes with PPA>=0.18 are shown in Fig. 1e. There was a strong enrichment for known PID genes (Wilcoxon P<1×10-200), supporting this statistical approach. Two novel BeviMed-identified genes were subsequently causally associated with PID. NFKB1 had the strongest probability of disease association (PPA=1.0), driven by truncating heterozygous variants in 13 patients. Subsequent assessment of co-segregation, protein expression, and B cell phenotype in pedigrees established these as disease-causing variants, and consequently loss of function variants in NFKB1 as the most common monogenic cause of CVID15. Evidence of association of ARPC1B with PID (PPA=0.18) was driven by 2 functionally validated cases, one homozygous for a complex InDel16 and the other described below.
The discovery of both known and subsequently validated new PID genes using BeviMed underlines its effectiveness in cohorts of unrelated patients with sporadic disease. Many candidate genes identified by BeviMed remain to be functionally validated and, as the PID cohort grows, even very rare causes of PID (e.g. affecting 0.2% of cases) will be detectable with a high positive predictive value (Supplementary Fig. 3).
Identification of regulatory elements contributing to PID
Sequence variation within non-coding regions of the genome can have profound effects on spatial and temporal gene expression17 and would be expected to contribute to PID susceptibility. We combined rare variant and deletion events with a tissue-relevant catalogue of cis-regulatory elements (CREs)18 generated using promoter capture Hi-C (pcHi-C)19 in matching tissues to prioritise putative causal PID genes (Fig. 2a). Being underpowered to detect single nucleotide variants affecting CREs20, we limited our initial analysis to rare structural variants (SV) overlapping exon, promoter or ‘super-enhancer’ CREs of known PID genes. No homozygous deletion events affecting CREs were identified, so we sought CRE SV deletions that might cause disease through a candidate compound heterozygote (cHET) mechanism with either a heterozygous rare coding variant or another SV in a pcHi-C linked gene (Fig. 2a). Out of 22,296 candidate cHET deletion events, after filtering by MAF, functional score and known PID gene status, we obtained 10 events; the functional follow-up of three is described (Fig. 2b).
The LRBA and DOCK8 cHET variants (Supplementary Fig. 4) were functionally validated; the former was demonstrated to result in impaired surface CTLA-4 expression on Treg cells (Supplementary Fig. 5) whilst the latter led to DOCK8 deficiency as confirmed by flow cytometry (data not shown). Although in these two cases SV deletions encompassed both non-coding CREs and coding exons, the use of WGS PID cohorts to detect a contribution of CREs confined to the non-coding space would represent a major advance in PID pathogenesis and diagnosis. ARPC1B fulfilled this criterion, with its BeviMed association partially driven by a patient cHET for a novel p.Leu247Glyfs*25 variant resulting in a premature stop, and a 9Kb deletion spanning the promoter region including an untranslated first exon (Fig. 2c) that has no coverage in the ExAC database (http://exac.broadinstitute.org). Two first-degree relatives were heterozygous for the frameshift variant, and two for the promoter deletion (Fig. 2d). Western blotting demonstrated complete absence of ARPC1B (Fig. 2e) and, consistent with previous reports21, raised ARPC1A in platelets. ARPC1B mRNA was almost absent from mononuclear cells in the cHET patient and reduced in a clinically unaffected sister carrying the frameshift mutation (Fig. 2f). An allele specific expression assay demonstrated that the promoter deletion essentially abolished mRNA expression (Fig. 2g,h).
These examples show the utility of WGS for detecting compound heterozygosity for a coding variant and a non-coding CRE deletion, and demonstrate a further advantage of a WGS approach to PID diagnosis. Improvements in analysis methodology, cohort size and better annotation of regulatory regions will be required to explore the non-coding space more fully and discover new disease-causing genetic variants.
WGS identifies PID-associated telomere shortening
A striking example of WGS data providing more than just the linear genomic sequence is telomere length estimation from mapped and unmapped reads22. We validated this method by showing correlation with gender (Fig. 3a) and a particularly strong correlation with age (Supplementary Fig. 6) in 3,313 NBR-RD subjects (Methods). We demonstrated the effectiveness of this, the first large-scale application of WGS-based telomere length estimation, by replicating an association with the telomerase RNA component gene (TERC: Supplementary Table 4)23 and identifying several PID cases with short telomeres (Fig. 3b). Given that disruption of telomerase genes can cause PID24, we looked for potentially damaging coding variants in known telomere deficiency genes25 in these PID cases, identifying 3 subjects with novel variants potentially causative for telomerase deficiency (Fig. 3b). One had a homozygous defect in telomerase reverse transcriptase (TERT), a subunit of the telomerase complex. Two male siblings were found to have a hemizygous variant in dyskerin (DKC1), known to be associated with PID and X-linked dyskeratosis congenita26 (Fig. 3c). Therefore, WGS telomere length estimation can be used as an effective approach to identify PID patients with novel variants causing telomere shortening.
GWAS of the WGS cohort reveals novel PID-associated loci
The diverse clinical phenotype and variable within-family disease penetrance of PID may be in part due to stochastic events (e.g. unpredictable pathogen transmission) but may also have a genetic basis. We therefore performed a GWAS of common SNPs (MAF>0.05), restricted to 733 AD-PID cases (Fig. 1a) to reduce phenotypic heterogeneity, and 9,225 unrelated NBR-RD controls. We confirmed the known MHC association and identified additional loci with suggestive association (Fig. 4a, Supplementary Fig. 7). A GWAS of SNPs of intermediate frequency (0.005<MAF<0.05) identified a single locus incorporating TNFRSF13B (Fig. 4a, Supplementary Table 5, Extended Data Fig. 1), for which the lead p.Cys104Arg variant has been previously reported27.
To increase power, we conducted a fixed effect meta-analysis of the AD-PID GWAS with summary statistics data from an ImmunoChip study of 778 CVID cases and 10,999 controls9 (Fig. 4a, Supplementary Table 5). This amplified the MHC and 16p13.13 associations9, found an additional locus at 3p24.1 within the promoter region of EOMES (Extended Data Fig. 2), and a suggestive association at 18p11.21 proximal to PTPN2 (Extended Data Fig. 3). Conditional analysis of the MHC locus revealed independent signals at the Class I and Class II regions (Supplementary Fig. 8), driven by classical alleles HLA-B*08:01 and HLA-DRB1*15:01 (Methods) with amino-acid changes known to impact upon peptide binding (Fig. 4b).
We next sought to examine, genome-wide, the enrichment of non-MHC AD-PID associations in 9 other diseases (Extended Data Table 1). We found significant enrichment for allergic (e.g. asthma) and immune-mediated diseases (e.g. Crohn’s disease), which was not evident in Type 2 diabetes or coronary artery disease (Fig. 4c). This suggests that the common variant association between PID and other immune-mediated diseases extends beyond the 4 genome-wide loci to multiple sub-genome-wide associations, and that dysregulation of common pathways contributes to susceptibility to both. Understanding the impact of these interrelationships will be a complex process. For example, while variants in the HLA-DRB1 and 16p13.13 loci increase the risk of both PID and autoimmunity, those at the EOMES locus predispose to PID but protect from rheumatoid arthritis28 (Extended Data Fig. 2).
Given this observed enrichment, we sought to investigate whether candidate genes identified through large cohort association analysis of immune-mediated disease might have utility in prioritising novel candidate genes harbouring rare coding variation causal for PID. We used the data-driven capture-HiC omnibus gene score (COGS) approach19 to prioritise putative causal genes across the 4 non-MHC AD-PID loci identified by our meta-analysis, and assessed across 11 immune-mediated diseases (Supplementary Tables 5 and 6). Hypothesising that causal PID genes would be intolerant to protein-truncating variation, we computed an overall prioritisation score by taking the product of pLI (a measure of tolerance to loss of gene function) and COGS gene scores for each disease. Six protein coding genes had an above average prioritisation score in one or more diseases (Fig. 4d) which we examined for rare, potentially causative variants within our cohort. We identified a single protein truncating variant in ETS1, SOCS1 and PTPN2 genes, all occurring exclusively in PID patients in the NBR-RD cohort. None of the genes are recognised causes of PID despite their involvement in immune processes (Supplementary Discussion). The two cases with SOCS1 and PTPN2 variants were analysed further.
The patient with a heterozygous protein-truncating SOCS1 variant (p.Met161Alafs*46) presented with CVID complicated by lung and liver inflammation and B cell lymphopenia (Supplementary Discussion, Supplementary Fig. 9). SOCS1 limits phosphorylation of targets including STAT1, and is a key regulator of IFN-γ signalling. SOCS1 haploinsufficiency in mice leads to B lymphopenia29,30, immune-mediated liver inflammation31 and colitis32. In patient T cell blasts SOCS1 was deficient and IFN-γ induced STAT1 phosphorylation was abnormal (Fig. 4e), consistent with SOCS1 haploinsufficiency causing PID. The patient also carries the SOCS1 pcHiC-linked 16p13.13 risk-allele identified in the AD-PID GWAS (Extended Data Fig. 4). Long read sequencing using Oxford Nanopore technology showed this to be in trans with the novel SOCS1-truncating variant (Methods); such compound heterozygosity raises the possibility that common and rare variants may combine to cause disease.
A more detailed example of an interplay between rare and common variants is provided by a family containing a novel PTPN2 premature stop-gain at p.Glu291 and a common autoimmunity-associated variant (Fig. 4f). PTPN2 encodes the non-receptor T-cell protein tyrosine phosphatase (TC-PTP) protein, that negatively regulates immune responses by dephosphorylation of the proteins mediating cytokine signalling. PTPN2 deficient mice are B cell lymphopenic33,34, while inducible haematopoietic deletion of PTPN2 leads to B and T cell proliferation and autoimmunity35. The novel truncating variant was identified in a “sporadic” index case presenting with CVID at age 20; he had B lymphopenia (Supplementary Fig. 9), low IgG, symmetrical rheumatoid-like polyarthropathy, severe recurrent bacterial infections, splenomegaly and inflammatory lung disease. His mother, also heterozygous for the PTPN2 truncating variant, had systemic lupus erythematosus (SLE), insulin-dependent diabetes mellitus diagnosed at 42, hypothyroidism and autoimmune neutropenia (Supplementary Discussion). Gain-of-function variants in STAT1 can present as CVID (Supplementary Table 3) and TC-PTP, like SOCS1, reduces phosphorylated-STAT1 (Fig. 4g). Both mother and son demonstrated reduced TC-PTP expression and STAT1 hyperphosphorylation in T cell blasts, similar to the SOCS1 haploinsufficient patient above and to known STAT1 GOF patients; abnormalities that were more pronounced in the PTPN2 index case (Fig. 4h).
The index case, but not his mother, carried the G allele of variant rs2847297 at the PTPN2 locus, an expression quantitative trait locus (eQTL)36 previously associated with rheumatoid arthritis37. His brother, generally healthy apart from severe allergic nasal polyposis, was heterozygous at rs2847297 and did not inherit the rare variant (Fig. 4f). Allele-specific expression analysis demonstrated reduced PTPN2 transcription from the rs2847297-G allele, explaining the lower expression of TC-PTP and greater persistence of pSTAT1 in the index case compared to his mother (Fig. 4i). This in turn could explain the variable disease penetrance in this family, with PTPN2 haploinsufficiency alone driving autoimmunity in the mother, but with the additional impact of the common variant on the index case causing immunodeficiency (and perhaps reducing the autoimmune phenotype). The family illustrates the power of cohort-wide WGS approach to PID diagnosis, by revealing both a new monogenic cause of disease, and how the interplay between common and rare genetic variants may contribute to the variable clinical phenotypes of PID.
In summary, we show that cohort-based WGS in PID is a powerful approach to provide immediate diagnosis of known genetic defects, and to discover new coding and non-coding variants associated with disease. Intriguingly, even with a limited sample size, we could explore the interface between common and rare variant genetics, explaining why PID encompasses such a complex range of clinical syndromes of variable penetrance. Increasing cohort size will be crucial for powering the analyses needed to identify both causal and disease-modifying variants, thus unlocking the potential of WGS for PID diagnosis. Improved analysis methodology and better integration of parallel datasets, such as GWAS and cell surface or metabolic immunophenotyping, will allow further exploration of the non-coding space and enhance diagnostic yield. Such an approach promises to transform our understanding of genotype-phenotype relationships in PID and related immune-mediated conditions, and could redefine the clinical boundaries of immunodeficiency, add to our understanding of human immunology, and ultimately improve patient outcomes.
Author Contributions
JEDT, ES, JS, ZZ, WR, NSG, PT, AJC carried out experiments. HLA, OSB, JEDT, JHRF, DG, IS, CP, SVVD, ASJ, JM, JS, PAL, AGL, KM, EE, DE, SFJ, THK, ET performed computational analysis of the data. HLA, IS, CP, MB, CS, RL, PJRM, JS, KES conducted sample and data processing. JEDT, ES, WR, MJT, RBS, PG, HEB, AW, SH, RL, MSB, KCG, DSK, SS, SOB, TWK, WHO, AJT recruited patients, provided clinical phenotype data and confirmed genetic diagnosis. All authors contributed to the analysis of the presented results. KGCS, JEDT, HLA and OSB wrote the paper with input from all other authors. KGCS, WHO, AJT and TWK conceived and oversaw the research programme.
The authors declare no competing financial interests.
Correspondence and requests for materials should be addressed to J.E.D.T. (jedt2{at}cam.ac.uk) and K.G.C.S. (kgcs2{at}cam.ac.uk)
Methods
PID cohort
The PID patients and their family members were recruited by specialists in clinical immunology across 26 hospitals in the UK, and one each from the Netherlands, France and Germany. The recruitment criteria were intentionally broad, and included the following: clinical diagnosis of common variable immunodeficiency disorder (CVID) according to internationally established criteria (Supplementary Table 1); extreme autoimmunity; or recurrent and/or unusual severe infections suggestive of defective innate or cell-mediated immunity. Patients with known secondary immunodeficiencies caused by cancer or HIV infection were excluded. Although screening for more common and obvious genetic causes of PID prior to enrolment into this WGS study was encouraged, it was not a requirement. Consequently, a minority of patients (16%) had some prior genetic testing, from single gene Sanger sequencing or MLPA to a gene panel screen.
To expedite recruitment a minimal clinical dataset was required for enrolment, though more detail was often provided. There was a large variety in patients’ phenotypes, from simple “chest infections” to complex syndromic features, and the collected phenotypic data of the sequenced individuals ranged from assigned disease category only to detailed clinical synopsis and immunophenotyping data. The clinical subsets used to subdivide PID patients were based on ESID definitions, as shown in Supplementary Table 1.
To facilitate analysis by grouping patients with a degree of phenotypic coherence while excluding some distinct and very rare clinical subtypes of PID that may have different aetiologies, a group of patients was determined to have antibody deficiency-associated PID (AD-PID). This group comprised 733 of the 886 unrelated index cases, and included all patients with CID, CVID or Antibody Defect ticked on the recruitment form, together with patients requiring IgG replacement therapy and those with specified low levels of IgG/A/M. SCID patients satisfying these criteria were not assigned to the AD-PID cohort.
WGS data processing
Details of DNA sample processing, whole genome sequencing, data processing pipeline, quality checks, alignment and variant calling, ancestry and relatedness estimation, variant normalisation and annotation, large deletion calling and filtering, and allele frequency calculations, are fully described in [NIHR BioResource, in preparation; see Cover Letter]. Briefly, DNA or whole blood EDTA samples were processed and quality checked according to standard laboratory practices and shipped on dry ice to the sequencing provider (Illumina Inc, Great Chesterford, UK). Illumina Inc performed further QC array genotyping, before fragmenting the samples to 450bp fragments and processing with the Illumina TruSeq DNA PCR-Free Sample Preparation kit (Illumina Inc., San Diego, CA, USA). Over the three-year duration of the sequencing phase of the project, different instruments and read lengths were used: for each sample, either 100bp reads on three HiSeq2500 lanes; or 125bp reads on two HiSeq2500 lanes; or 150bp reads on a single HiSeq X lane. Each delivered genome had a minimum 15X coverage over at least 95% of the reference autosomes. Illumina performed the alignment to GRCh37 genome build and SNV/InDel calling using their Isaac software, while large deletions were called with their Manta and Canvas algorithms. The WGS data files were received at the University of Cambridge High Performance Computing Service (HPC) for further QC and processing by our Pipeline team.
For each sample, we estimated the sex karyotype and computed pair-wise kinship coefficients using PLINK, which allowed us to identify sample swaps and unintended duplicates, assign ethnicities, generate networks of closely related individuals (sometimes undeclared relatives from across different disease domains) and a maximal unrelated sample set (for the purposes of allele frequency estimation and control dataset in case-control analyses). Variants in the gVCF files were normalised and loaded into an HBase database, where Overall Pass Rate (OPR) was computed within each of the three read length batches, and the lowest of these OPR values (minOPR) assigned to each variant.
Large deletions were merged and analysed collectively, as described in [NIHR BioResource, in preparation]. The analyses presented here are based on SNVs/InDels with OPR>0.98, and a set of deletions found through the SVH method to have high specificity after extensive manual inspection of individual deletion calls. Variants were annotated with Sequence Ontology terms according to their predicted consequences, their frequencies in other genomic databases (gnomAD, UK10K, 1000 Genomes), if they have been associated with a disease according to the HGMD Pro database, and internal metrics (AN, AC, AF, OPR).
Diagnostic reporting
We screened all genes in the IUIS 2015 classification for potentially causal variants. SNVs and small InDels were filtered based on the following criteria: OPR>0.95; having a protein-truncating consequence, gnomAD AF<0.001 and internal AF<0.01; or present in the HGMD Pro database as DM variant. Large deletions called by both Canvas and Manta algorithms, passing standard Illumina quality filters, overlapping at least one exon, and classified as rare by the SVH method were included in the analysis. In order to aid variant interpretation and consistency in reporting, phenotypes were translated into Human Phenotype Ontology (HPO) terms as much as possible. Multi-Disciplinary Team (MDT) then reviewed each variant for evidence of pathogenicity and contribution to the phenotype, and classified them according to the American College of Medical Genetics (ACMG) guidelines38. Only variants classified as Pathogenic or Likely Pathogenic were systematically reported, but individual rare (gnomAD AF<0.001) or novel missense variants that BeviMed analysis (see below) highlighted as having a posterior probability of pathogenicity >0.2 were additionally considered as Variants of Unknown Significance (VUS). If the MDT decided that they were likely to be pathogenic and contribute to the phenotype, they were also reported and counted towards the overall diagnostic yield. All variants and breakpoints of large deletions reported in this study were confirmed by Sanger sequencing using standard protocols.
BeviMed
We used BeviMed1 to evaluate the evidence for association between case/control status and rare variant allele counts in each gene. We inferred a posterior probability of association (PPA) under Mendelian inheritance models (dominant and recessive), and different variant selection criteria ("moderate" and "high" impact variants based on functional consequences predicted by the Variant Effect Predictor39). All genes were assigned the same prior probability of association with the disease of 0.01, regardless of their previously published associations with an immune deficiency phenotype. Genes for which BeviMed inferred a PPA to be >=0.18 when summed over all four combinations of inheritance model and variant selection criteria (each configuration being given a prior probability of association of 0.0025) are shown in Fig. 1f. Given each of the association models, the posterior probability that each variant is pathogenic is also computed. We used a variant-level posterior probability of pathogenicity >0.2 to select potentially pathogenic missense variants in known PID genes to report back.
Telomerecat
Average telomere length was calculated from whole-genome sequence data using Telomerecat, as reported previously22. Batch differences caused by changes in sequencing platform differences were normalised by using a linear model. The linear model was defined as: where β are regression coefficients, and batch represents a dummy variable denoting the plate a sample was sequenced on. For each plate the relevant coefficient was subtracted from all of the observed telomere lengths within each plate.
After adjusting for batch effects, telomere length was compared to age in 3,313 NBR-RD subjects. We obtained a strong negative correlation with age (r = −0.56, Pearson’s correlation), thus validating Telomerecat as a reliable method for estimating telomere lengths. We found that each year of additional age was equivalent to a 33bp deterioration in telomere length (Supplementary Fig. 6). Although this observed negative correlation is well established within the literature, we obtain a particularly high correlation with our method, which could be partly driven by the wide age range of our sample set.
To normalise telomere lengths for comparison of samples from disparate age and gender, the following linear model was fitted to the data using age as a continuous variable and gender as a dummy variable:
The relevant residuals produced by the cubic model were subtracted from the mean telomere length of the cohort. These adjusted telomere lengths were used in the GWAS analysis.
To assess for monogenic causes of telomere shortening, subjects were identified within the PID cohort that had telomere lengths below the 10th centile of age adjusted values and had hemizygous or homozygous SNVs that occurred gnomAD AF<0.001 in TERC, TERT, NHP2, TINF2, NOP10, PARN, ACD, WRAP53, CTC1, RTEL1 or DKC1 genes.
AD-PID GWAS
GWAS was performed both on the whole PID cohort (N cases = 886) and on a subset of AD-PID cases (N cases = 733); here we present the results of the latter analysis, which was cleaner and less noisy despite a reduced sample size. We used 9225 unrelated samples from non-PID NBR-RD cohorts as controls.
Variants were selected from a merged VCF file were filtered to include bi-allelic SNPs with overall MAF>=0.05 and minOPR=1 (100% pass rate). We ran PLINK logistic association test under an additive model using the read length, sex, and first 10 principal components from the ethnicity analysis as covariates. After filtering out SNPs with HWE p<10-6, we were left with the total of 4,993,945 analysed SNPs. There was minimal genomic inflation of the test statistic (lambda = 1.027), suggesting population substructure and sample relatedness had been appropriately accounted for. The only genome-wide significant (p<5×10-8) signal was at the MHC locus, with several suggestive (p<1×10-5) signals (Supplementary Fig. 7). We repeated the analysis with more relaxed SNP filtering criteria using MAF>=0.005 and minOPR>0.95. The only additional signal identified were the three TNFRSF13B variants shown in Extended Data Fig. 1.
We obtained summary statistics data from the Li et al. CVID Immunochip case-control study9 and performed a fixed effects meta-analysis on 95,417 variants shared with our AD-PID GWAS. For each of the genome-wide and suggestive loci after meta-analysis, we conditioned on the lead SNP by including it as an additional covariate in the logistic regression model, to determine if the signal is driven by the single or multiple hits at those loci. Only the MHC locus showed evidence of multiple independent signals (Supplementary Fig. 8).
MHC locus imputation
We imputed classical HLA alleles using the method implemented in the SNP2HLA v1.0.3 package40, which uses Beagle v3.0.4 for imputation and the HapMap CEU reference panel. We imputed allele dosages and best-guess genotypes of 2-digit and 4-digit classical HLA alleles, as well as amino acids of the MHC locus genes HLA-A, HLA-B, HLA-C, HLA-DRB1, HLA-DQA1 and HLA-DQB1. We tested the association of both allele dosages and genotypes using the logistic regression implemented in PLINK, and obtained similar results. We then used the best-guess genotypes to perform the conditional analysis in PLINK, since conditioning is not implemented in a model with allele dosages.
Allele Specific Expression
RNA and gDNA were extracted from PBMCs using the AllPrep kit (Qiagen) as per the manufacturer’s instructions. RNA was reverse transcribed to make cDNA using the SuperScriptTM VILOTMcDNA synthesis kit with appropriate minus reverse transcriptase controls, as per the manufacturer’s instructions. The region of interest in the gDNA and 1:10 diluted cDNA was amplified using Phusion (Thermo Fisher) and the following primers on a G-Storm thermal cycler with 30 seconds at 98°C then 35 cycles of 98°C 10 seconds, 60°C 30 seconds, 72°C 15 seconds.
ARPC1B
The region of interest spanning the frameshift variant was amplified using the following primers: Forward: GGGTACATGGCGTCTGTTTC / Reverse: CACCAGGCTGTTGTCTGTGA
PCR products were run on a 3.5% agarose gel. Bands were cut out and product extracted using the QIA Quick Gel Extraction Kit (Qiagen), as per protocol. Expected products were confirmed by Sanger sequencing. 4ul fresh PCR product was used in a TOPO®cloning reaction (Invitrogen) and used to transform One Shot™ TOP10 chemically competent E. coli. These were cultured overnight then spread on LB agar plates. Individual colonies were picked and genotyped. ARPC1B mRNA expression was assessed using a Taqman gene expression assay with 18S and EEF1A1 as control genes. Each sample was run in triplicate for each gene with a no template control. PCR was run on a LightCycler® (Roche) with 2 mins 50°C, 20 seconds 95°C then 45 cycles of 95°C 3 seconds, 60°C 30 seconds.
PTPN2
PTPN2 ASE protocol is modified from above. RNA and genomic DNA were extracted from PBMCs using the AllPrep Kit (Qiagen). RNA was treated with Turbo DNAse (Thermo) and reverse transcribed to generate cDNA using the SuperScript IV VILO master mix (Thermo). The intronic region of interest in gDNA and cDNA was amplified by two nested PCR reactions using Phusion enzyme (Thermo). The primers (F1/R1) and nested primers (F2/R2) used were:
Forward_1: aaagtctggagcaggcagag / Reverse_1: tgggggaactggttatgctttc
Forward_2: ggagctatgatcacgccacatg / Reverse_2: atgctttctggttgggctgac
PCR products were run on a 1% agarose gel. Bands were cut out and product extracted using the QIA Quick Gel Extraction Kit (Qiagen), as per protocol. Expected products were confirmed by Sanger sequencing. 5ng fresh PCR product was used in a TOPO®cloning reaction (Invitrogen) and used to transform One Shot™ TOP10 chemically competent E. coli. These were cultured overnight then spread on LB agar plates. Individual colonies were picked and genotyped. PTPN2 mRNA expression was assessed using a Taqman SNP genotyping assay and on a LightCycler (Roche).
PAGE and Western Blot analysis
Samples were separated by SDS polyacrylamide gel electrophoresis and transferred onto a nitrocellulose membrane. Individual proteins were detected with antibodies against ARPC1b (goat polyclonal antibodies, ThermoScientific, Rockford, IL, USA), against ARPC1a (rabbit polyclonal antibodies, Sigma, St Louis, USA) and against actin (mouse monoclonal antibody, Sigma). Secondary antibodies were either donkey-anti-goat-IgG IRDye 800CW, Goat-anti-mouse-IgG IRDye 800CW or Donkey-anti-rabbit-IgG IRDye 680CW (LI-COR Biosciences, Lincoln, NE, USA). Quantification of bound antibodies was performed on an Odyssey Infrared Imaging system (LI-COR Biosciences, Lincoln, NE, USA).
Phasing of SOCS1 variants
To phase common rs2286974 variant with the novel stop-gain SOCS1 variant (chr16:11348854 T>TGCGGC) identified in the same patient, we performed long-read WGS with Oxford Nanopore Technologies (ONT). The sample was prepared using the 1D ligation library prep kit (SQK-LSK108), and genomic libraries were sequenced on R9.4 flowcells. Sequencing was carried out on GridION system, read sequences were extracted from base-called FAST5 files by Guppy (v0.5.1) to generate FASTQ files, which were then aligned against the GRCh37/hg19 human reference genome using minimap2 (v2.2). Four runs were performed in order to reach an average coverage of 14x, with a median read length of 5006 ± 3981. Haplotyping and genotyping was performed with MarginPhase.
Structural deletion analysis
Structural (length >50bp) deletions (MAF>0.03) were called as previously described41. For all downstream analysis we used gencode v26 annotations downloaded from [ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/GRCh37_mapping/gencode.v26lift37.annotation.gtf.gz]. We defined promoters as a window +/-500bp of any protein coding gene transcriptional start site (TSS). In order to associate cis regulatory elements (cRE) with putative target genes we combined by physical location overlap, super enhancer cRE annotations from 18, with promoter capture Hi-C (pcHi-C) from 19, matching by tissue. We next computed the overlap of structural variants occurring in the PID cohort with cREs for which putative target genes were available. We classified overlaps between deletions and functional annotations into three non-mutually exclusive categories; ‘prom’-overlaps focal gene promoter, ‘exon’ - overlaps focal gene exon, ‘pse’ - overlaps Hnisz et al.18 SE annotation linked to focal gene by pcHi-C. We compiled a catalogue of compound heterozygous deletions where there was evidence in the same individual for a damaging (CADD>20) rare (gnomAD AF<0.001) variant within the same gene.
AD-PID GWAS Enrichment
Due to the size of the AD-PID cohort, we were unable to use LD-score regression42 to assess genetic correlation between distinct and related traits. We therefore adapted the previous enrichment method ‘blockshifter’43 in order to assess evidence for the enrichment of AD-PID association signals in a compendium of 9 GWAS European Ancestry summary statistics was assembled from publicly available data. We removed the MHC region from all downstream analysis [GRCh37 chr6:25-45Mb]. To adjust for linkage disequilibrium (LD), we split the genome into 1cM recombination blocks based on HapMap recombination frequencies 44. For a given GWAS trait, for n variants within LD block b we used Wakefield’s synthesis of asymptotic Bayes factors (aBF)45 to compute the posterior probability that the ith variant is causal (PPCVi) under single causal variant assumptions46 :
Here πi = πjare flat prior probabilities for a randomly selected variant from the genome to be causal and we use the value 1×10-4 47. We sum over these PPCV within an LD block, b to obtain the posterior probability that b contains a single causal variant (PPCB).
To compute enrichment for trait t, we convert PPCBs into a binary label by applying a threshold such that PPCBt > 0.95. We apply these block labels for trait t, to PPCBs (computed as described above) for our AD-PID cohort GWAS, using them to compute a non-parametric Wilcoxon rank sum statistic, W representing the enrichment. Whilst the aBF approach naturally adjusts for LD within a block, residual LD between blocks may exist. In order to adjust for this and other confounders (e.g. block size) we use a circularised permutation technique48 to compute Wnull. To do this, for a given chromosome, we select recombination blocks, and circularise such that beginning of the first block adjoins the end of the last. Permutation proceeds by rotating the block labels, but maintaining AD-PID PPCB assignment. In this way many permutations of Wnull can be computed whilst conserving the overall block structure.
For each trait we used 104 permutations to compute adjusted Wilcoxon rank sum scores using wgsea [https://github.com/chr1swallace/wgsea] R package.
PID monogenic candidate gene prioritisation
We hypothesised, given the genetic overlap with antibody associated PID, that common regulatory variation, elucidated through association studies of immune-mediated disease, might prioritise genes harbouring damaging LOF variants underlying PID. Firstly, using summary statistics from our combined fixed effect meta-analysis of AD-PID, we compiled a list of densely genotyped ImmunoChip regions containing one or more variant where P<1×10-5. Next, we downloaded ImmunoChip (IC) summary statistics from ImmunoBase (accessed 30/07/2018) for all 11 available studies. For each study we intersected PID suggestive regions, and used COGS (https://github.com/ollyburren/rCOGS) in conjunction with promoter-capture Hi-C datasets for 17 primary cell lines19,43 in order to prioritise genes. We filtered by COGS score to select genes with a COGS score >0.5 19,43 to obtain a list of 11 protein coding genes.
We further hypothesised that genes harbouring rare LOF variation causal for PID would be intolerant to variation. We thus downloaded pLI scores49 and took the product between these and the COGS scores to compute an ‘overall’ prioritisation score across each trait and gene combination. We applied a final filter taking forward only those genes having an above average ‘overall’ score to obtain a final list of 6 candidate genes (Fig. 4d). Finally, we filtered the cohort for damaging rare (gnomAD AF<0.001) protein-truncating variants (frameshift, splice-site, nonsense) within these genes in order to identify individuals for functional follow up.
Statistical analysis
Statistical analysis was carried out using R (3.3.3 – “Another Canoe”) and Graphpad Prism (version 7) unless otherwise stated. R code for running major analyses are available at https://github.com/ollyburren/pid_thaventhiran_et_al.
Acknowledgements
Funding for the NIHR-BioResource was provided by the National Institute for Health Research (NIHR, grant number RG65966). We gratefully acknowledge the participation of all NIHR BioResource volunteers, and thank the NIHR BioResource centre and staff for their contribution. J.E.D.T. is supported by the MRC (RG95376 and MR/L006197/1). AJT is supported by the Wellcome Trust (104807/Z/14/Z) and the NIHR Biomedical Research Centre at Great Ormond Street Hospital for Children NHS Foundation Trust and University College London. KGCS is supported by the Medical Research Council (program grant MR/L019027) and is a Wellcome Investigator. AJC was supported by the Wellcome [091157/Z/10/Z], [107212/Z/15/Z], [100140/Z/12/Z], [203141/Z/16/Z]; JDRF [9-2011-253], [5-SRA-2015-130-A-N]; NIHR Oxford Biomedical Research Centre and the NIHR Cambridge Biomedical Research Centre. EE has received funding from the European Union Seventh Framework Programme (FP7-PEOPLE-2013-COFUND) under grant agreement no 609020-Scientia Fellows.