Genome-wide association study of diabetic kidney disease highlights biology involved in renal basement membrane collagen

Diabetic kidney disease (DKD) is a heritable but poorly understood complication of diabetes. To identify genetic variants predisposing to DKD, we performed genome-wide association analyses in 19,406 individuals with type 1 diabetes (T1D) using a spectrum of DKD definitions basedon albuminuria and renal function. We identified 16 genome-wide significant loci. The variant with the strongest association (rs55703767) is a common missense mutation in the collagen type IV alpha 3 chain (COL4A3) gene, which encodes a major structural component of the glomerular basement membrane (GBM) implicated in heritable nephropathies. The rs55703767 minor allele (Asp326Tyr) is protective against several definitions of DKD, including albuminuria and end-stage renal disease. Three other loci are in or near genes with known or suggestive involvement in DKD (BMP7) or renal biology (COLEC11 and DDR1). The 16 DKD-associated loci provide novel insights into the pathogenesis of DKD, identifying potential biological targets for prevention and treatment.


KEYWORDS
Diabetic kidney disease, genome-wide association study, type 1 diabetes, genetics, diabetic complications, COL4A3 The devastating diabetic complication of DKD is the major cause of end-stage renal disease (ESRD) worldwide 1,2 . Current treatment strategies at best slow the progression of DKD, and do not halt or reverse the disease. Although improved glycemic control influences the rate of diabetic complications, a large portion of the variation in DKD susceptibility remains unexplained: one third of people with T1D develop DKD despite adequate glycemic control, while others maintain normal renal function despite long-term severe chronic hyperglycemia 3 .
Potential reasons for the limited success include small sample sizes, modest genetic effects, and lack of consistency of phenotype definitions and statistical analyses across studies.
Through collaboration within the JDRF Diabetes Nephropathy Collaborative Research Initiative (DNCRI), we adopted three approaches to improve our ability to find new genetic risk factors for DKD: 1) assembling a large collection of T1D cohorts with harmonized DKD phenotypes, 2) creating a comprehensive set of detailed DKD definitions, and 3) augmenting genotype data with low frequency and exome array variants.

Phenotypic comparisons
We investigated a broad spectrum of DKD definitions based on albuminuria and renal function criteria, defining a total of 10 different case-control comparisons to cover the different aspects of disease progression (Figure 1). Seven comparisons were based on albuminuria and/or ESRD (including diabetic nephropathy [DN], defined as either macroalbuminuria or ESRD); two were defined based on eGFR (used to classify severity of chronic kidney disease [CKD]); and one combined both albuminuria and eGFR data ("CKD-DN"). Each phenotypic definition was analyzed separately in GWAS; to account for the 10 definitions each analyzed under two covariate adjustment models, we estimated 14 the total effectively independent tests as 7.4, allowing us to compute a more conservative study-wide significance threshold (P<6.76×10 -9 ), based on genome-wide significance (P<5×10 -8 ) and Bonferroni correction for 7.4 effective tests.

Top genome-wide association results highlight COL4A3
GWAS meta-analysis included association results for up to 19,406 individuals with T1D of European descent from 17 cohorts for the 10 case-control definitions (Table S1). We identified 16 novel independent loci that achieved genome-wide significance (P<5×10 -8 ), in which four lead SNPs also surpassed our more conservative study-wide significance threshold (Table 1; Interestingly, we found that rs55703767 in COL4A3 was more strongly associated in men (OR=0.73, P=1.29×10 -11 ) than in women (OR=0.85, P=1.39×10 -3 ; Phet=1.58×10 -2 ). COL4A3 encodes the alpha 3 chain of collagen type IV, a major structural component of the GBM 15 .

COL4A3 variation and kidney phenotypes
In persons with T1D and normoalbuminuria, GBM width predicts progression to proteinuria and ESRD independently of glycated hemoglobin (HbA1c) 16 . We examined the influence of the COL4A3 variant on GBM width in 253 Renin-Angiotensin System Study (RASS) 17 participants with T1D who had biopsy and genetic data ( Table S2). The DKD-protective minor T allele was associated with 19.7 nm lower GBM width (standard error (SE) 8.2 nm, P=0.02), with the lowest mean GBM width among TT homozygotes (Figure 3; Table S3), after adjusting for age, sex,   and diabetes duration, and without detectable interactions with T1D duration or mean HbA1c. Furthermore, in a Pima Indian cohort of 97 subjects with DKD with morphometric and expression data from renal biopsies, COL4A3 expression was negatively correlated with the GBM surface density (filtration surface density) (β=-0.27, P=0.02), which is associated with eGFR in DKD in both T1D and type 2 diabetes (T2D) 18,19 .

Evidence for hyperglycemia specificity
Hyperglylcemia promotes the development of diabetic complications. If a genetic variant exerts a stronger effect in the setting of hyperglycemia, 1) it might not be detected in general CKD, 2) it may be detected whether hyperglycemia is conferred by T1D or T2D, 3) its effect may be stronger at higher glycemic strata, and 4) interventions that reduce glycemia may attenuate the association signal. COL4A3 rs55703767 was not associated with eGFR in a general population sample of 110,517 mainly non-diabetic participants of European ancestry 20 (Table S4).
However, in a smaller cohort of 5,190 participants with T2D and DKD phenotypes in the SUrrogate markers for Micro-and Macro-vascular hard endpoints for Innovative diabetes Tools (SUMMIT) consortium, we detected a directionally consistent suggestive association of COL4A3 rs55703767 with DN (2-tailed P=0.08; Table S5).
We further stratified the association analyses by HbA1c in the Finnish Diabetic Nephropathy (FinnDiane) Study, a cohort study with extensive longitudinal phenotypic data 21 . Based on the time-weighted mean of all available HbA1c measurements for each individual, 1,344 individuals had mean HbA1c <7.5% (58 mmol/mol), and 2,977 with mean HbA1c ≥7.5%. COL4A3 rs55703767 was nominally significant (P<0.05) only in individuals with HbA1c ≥7.5% (Figure 4, Table S6, Figure S3). However, in a similar study setting of individuals with T2D from the GoDARTS study (N=3226) 13,22 , no difference was observed for COL4A3 rs55703767 between HbA1c strata ( Figure S4).
We performed a similar HbA1c stratified analysis in the Diabetes Control and Complications Trial (DCCT), whose participants continue to be followed in the Epidemiology of Diabetes Interventions and Complications (EDIC) 23,24 . In DCCT-EDIC the effect of COL4A3 rs55703767 was stronger among those recruited in the secondary cohort (mild retinopathy and longer diabetes duration at baseline) who were originally randomized to conventional treatment and therefore had higher HbA1c than the intensive treatment group (Table S7). Taken together, these independent lines of evidence strongly suggest that the COL4A3 variant effect on DKD risk is amplified by poor glycemic control.

Other association signals
Two other genome-wide significant signals were near genes encoding proteins related to In addition to COL4A3 rs55703767, three other low-frequency variants associated with microalbuminuria achieved study-wide significance: rs142823282 (MAF=0.017), 22 kb upstream of TAMM41 encoding a mitochondrial translocator assembly and maintenance protein 25,26 (OR=6.75, P=1.13×10 -11 ), rs144434404 (MAF=0.011), in intron 1 of BMP7 encoding the bone morphogenetic protein 7 previously implicated in DKD 27  and four with MAF<0.01) and did not achieve study-wide significance.
As we had done for COL4A3 rs55703767, we tested whether the associations of the 15 other variants were amplified by hyperglycemia. None of the 15 variants were significantly associated with eGFR in the general population (Table S4). In the smaller SUMMIT T2D cohort 13 we were able to interrogate seven loci with comparable trait definitions. The odds ratios were directionally consistent in six of them (binomial sign test: P=0.0625, Table S5). In FinnDiane seven of the remaining 15 loci were observed with sufficient frequency (minor allele counts >10) to allow subgroup analysis. Two additional SNPs (rs149641852 in SNCAIP and rs12615970 near COLEC11) were nominally significant (P<0.05) only in individuals with HbA1c ≥7.5% (Table S6, Figure S3).

Variants previously associated with DKD
We investigated the effect of variants previously associated at genome-wide significance with renal complications in individuals with diabetes [8][9][10][11][12][13]28 . Across the ten sub-phenotypes in our meta-analysis, we found evidence of association for seven of nine examined loci (P<0.05, Figure S5): We replicated two loci that were previously discovered without overlapping individuals with the current study: SCAF8/CNKSR3 rs12523822, originally associated with DKD (P=6.8×10 -4 for "All vs ctrl") 8 ; and UMOD rs77924615, originally associated with eGFR in both individuals with and without diabetes (P=5.2×10 -4 for "CKD") 20 . Associations at the AFF3, RGMA-MCTP2, and ERBB4 loci, identified in the GEnetics of Nephropathy-an International Effort (GENIE) consortium 12 , comprised of a subset of studies included in this current effort, remained associated with DKD, though the associations were attenuated in this larger dataset P=3.53×10 -5 ; Figure S6). Associations were also observed at the CDCA7/SP3 (rs4972593, P=0.020 for "CKD-DN", originally for ESRD exclusively in women 11 ) and GLRA3 (rs1564939, P=0.016 for "CKD extremes", originally for AER 10,28 ), but these analyses also include individuals that overlap with the original studies. Apart from the UMOD locus, none of the 63 loci associated with eGFR in the general population 20 were associated with DKD after correction for multiple testing (P<7.0×10 -4 , Figure S7).
Additionally, we used MAGMA, PASCAL, DEPICT, and MAGENTA to conduct gene-set analysis in our GWAS dataset. The four methods identified 12 significantly enriched gene sets (Table S10). One gene set, "negative regulators of RIG-I MDA5 signaling" was identified in two different pathway analyses (MAGMA and PASCAL) of our fully adjusted GWAS of ESRD vs.
Macro. Several additional related and overlapping gene sets were identified, including "RIGI MDA5 mediated induction of IFN alpha beta pathways", "TRAF3 dependent IRF activation pathway", and "TRAF6 mediated IRF activation" (PASCAL) and "activated TLR4 signaling" (MAGENTA). RIG-I, MDA5 and the toll-like receptor TLR4 are members of the innate immune response system that respond to both cellular injury and infection 33,34 and transduce highly intertwined signaling cascades. These include the signaling molecules TRAF3 and TRAF6, which induce expression of type I interferons and proinflammatory cytokines implicated in the progression of DKD 35,36 . Specifically, the TLR4 receptor and several of its ligands and downstream cytokines display differential levels of expression in DKD renal tubules vs. normal kidneys and vs. non-diabetic kidney disease controls 37 , and TLR4 knockout mice are protected from DKD and display marked reductions in interstitial collagen deposition in the kidney 38 . Other pathways of interest include "other lipid, fatty acid and steroid metabolism", "nitric oxide signaling in the cardiovascular system", and "Tumor necrosis factor (TNF) family member", with both nitric oxide and TNF-implicated in DKD 39,40,41 .

Expression and epigenetic analyses
We interrogated gene expression datasets in relevant tissues to determine whether our top signals underlie expression quantitative trait loci (eQTL). We first analyzed genotype and RNAseq gene expression data from 96 whole human kidney cortical samples 42 and microdissected human kidney samples (121 tubule and 119 glomerular samples 43 from subjects of European descent without any evidence of renal disease ( Figure S8). No findings in this data set achieved significance after correction for multiple testing. In the GTEx and eQTLgen datasets, COL4A3 rs55703767 had a significant eQTL (P=5.63×10 -38 ) with the MFF gene in blood, and rs118124843 near DDR1 and VARS2 had multiple significant eQTLs in blood besides VARS2 (P=1.71×10 -5 ; Table S11). Interestingly, rs142823282 near TAMM41 was a cis-eQTL for PPARG (P=4.60×10 -7 ), a transcription factor regulating adipocyte development, glucose and lipid metabolism; PPARγ agonists have been suggested to prevent DKD 44 .
To ascertain the potential functional role of our top non-coding signals, we mined ChIP-seq data derived from healthy adult human kidney samples 45 . SNP rs142823282 near TAMM41 showed enrichment for the histone marks H3K27ac, H3K9ac, H3K4me1, and H3K4me3, suggesting that this is an active regulator of TAMM41 or another nearby gene ( Figure S9). Interestingly, in recent work we have shown that DNA methylation profiles in participants with T1D with/without kidney disease show the greatest differences in methylation sites near TAMM41 46 .
To establish whether the expression of our top genes shows enrichment in a specific kidney cell type, we queried an expression dataset of ~50,000 single cells obtained from mouse kidneys 47 .
Expression was detected for six genes in the mouse kidney atlas: three (COL4A3, SNCAIP, and BMP7) were almost exclusively expressed in podocytes (Figure 5), supporting the significant role for podocytes in DKD.

DISCUSSION
Our genome-wide analysis of 19,406 participants with T1D identified 16 genome-wide significant loci associated with DKD, four of which remained significant after a conservative correction for multiple testing. Four of the 16 genome-wide significant signals are in or near genes with known or suggestive biology related to renal function/collagen (COL4A3, BMP7, COLEC11, and DDR1), but this is the first time that naturally occurring variation (MAF > 1%) in these loci has been associated with DKD. Our most significant signal was a protective missense variant in COL4A3, rs55703767, reaching both genome-wide and study-wide significance with multiple definitions of DKD. Moreover, this variant demonstrated significant association with GBM width and with the degree of fibrosis in renal tubules and glomeruli, and its effect was dependent on glycemia.
COL4A3, with COL4A4 and COL4A5, make up the so-called "novel chains" of type IV collagen 48 , which together play both structural and signaling roles in the GBM. Specifically, COL4A3 is known to bind a number of molecules including integrins, heparin and heparin sulfate proteoglycans, and other components of the GBM such as laminin and nidogen. These interactions mediate the contact between cells and the underlying collagen IV basement membrane, and regulate various processes essential to embryonic development and normal physiology including cell adhesion, proliferation, survival and differentiation. Dysregulation of these interactions has been implicated in several pathological conditions including CKD 49 .
Mutations in COL4A3 are responsible for the autosomal recessive form of Alport syndrome, a progressive inherited nephropathy, as well as benign familial hematuria, characterized by thin (or variable width) GBM, and thought to be a milder form of Alport syndrome 50 . The rs55703767 SNP is predicted to alter the third amino acid of the canonical triple-helical domain sequence of Glycine (G)-X-Y (where X and Y are often proline (P) and hydroxyproline (Y), respectively) from G-E-D (D=Aspartic) to G-E-Y 51 , potentially impacting the structure of the collagen complex. In addition, a recent study 52 of candidate genes involved in renal structure reported rs34505188 in COL4A3 (not in linkage disequilibrium with rs55703767, r 2 =0.0006) to be associated with ESRD in African Americans with T2D (MAF=2%, OR=1.55, P=5×10 -4 ). Together with the trend towards association we have seen in SUMMIT and the glycemic interaction we have reported here, these findings suggest variation in COL4A3 may be associated with DKD in T2D as well.
Given its association as a protective SNP, we can speculate that the rs55703767 variant may confer tensile strength or flexibility to the GBM, which may be of particular relevance in the glomerular hypertension associated with DKD. Alternatively, COL4A3 may regulate the rates of production and/or turnover of other GBM components, affecting GBM width changes in diabetes. How these effects might confer protection in a manner dependent on ambient glucose concentrations is unknown. Future mechanistic studies will be required to determine the precise role of this variant in DKD; elucidation of its interaction with glycemia in providing protection might be relevant to other molecules implicated in diabetic complications.
In keeping with the collagen pathway, the synonymous exonic variant rs118124843, which reached genome-wide significance for the "Micro" phenotype, is located near DDR1, the gene encoding the discoidin domain-containing receptor 1. Based on chromatin conformation interaction data from Capture HiC Plotter (CHiCP), 53 the rs118124843 containing fragment interacts with six gene promoter regions, including DDR1, suggesting that the variant regulates DDR1 expression across multiple tissues (Table S11). DDR1 is a collagen receptor 54 shown to bind type IV collagen 55 , and is highly expressed in kidneys, particularly upon renal injury 56 . Upon renal injury, Ddr1-deficient mice display lower levels of collagen 57 , decreased proteinuria, and an increased survival rate compared to wild-type controls 58 , with Ddr1/Col4a3 double knockout mice displaying protection from progressive renal fibrosis and prolonged lifespan compared to Col4a3 knockout mice alone 57 . Thus, through its role in collagen binding DDR1 has been suggested as a possible therapeutic target for kidney disease 57 .
The association of rs12615970, an intronic variant on chromosome 2 near the COLEC11 gene, met genome-wide significance for the CKD phenotype, as well as nominal significance for multiple albuminuria-based traits. The rs12615970 containing fragment was found to interact with COLEC11, ALLC, and ADI1 transcription start sites in chromatin conformation data on GM12878 cell line (Table S11) 53,59 . Collectin-11 is an innate immune factor synthesized by multiple cell types, including renal epithelial cells with a role in pattern recognition and host defense against invasive pathogens, through binding to fructose and mannose sugar moieties 60,61 . Mice with kidney-specific deficiency of COLEC11 are protected against ischemiainduced tubule injury due to reduced complement deposition 62 , and mutations in COLEC11 have been identified in families with 3MC syndrome, a series of rare autosomal recessive disorders resulting in birth defects and abnormal development, including kidney abnormalities 63 .
The intronic variant rs144434404, associated at study-wide significance with the microalbuminuria phenotype, resides within the bone morphogenetic protein 7 (BMP7) gene.
BMP7 encodes a secreted ligand of the transforming growth factor-beta superfamily of proteins.
Developmental processes are regulated by the BMP family of glycosylated extracellular matrix molecules, via serine/threonine kinase receptors and canonical Smad pathway signaling.
Coordinated regulation of both BMP and BMP-antagonist expression is essential for developing tissues, and changes in the levels of either BMP or BMP-antagonists can contribute to disease progression such as fibrosis and cancer 64 . BMP7 is required for renal morphogenesis, and Bmp7 knockout mice die soon after birth due to reduced ureteric bud branching [65][66][67] .
Maintenance of Bmp7 expression in glomerular podocytes and proximal tubules of diabetic mice prevents podocyte loss and reduces overall diabetic renal injury 27 . More recently, we have identified a mechanism through which BMP7 orchestrates renal protection through Akt inhibition and highlights Akt inhibitors as potential anti-fibrotic therapeutics 68 . It is also noteworthy that the BMP7 antagonist grem-1 is implicated in DKD [69][70][71] and gremlin has been implicated as a biomarker of kidney disease 72 .
Strengths of this analysis include the large sample size, triple that of the previous largest GWAS; the uniform genotyping and quality control procedures; standardized imputation for all studies (1,000 Genomes reference panel); the inclusion of exome array content; the exploration of multiple standardized phenotype definitions of DKD; and supportive data from various sources of human kidney samples. Several of the loci identified have known correlations with kidney biology, suggesting that these are likely true associations with DKD. However, we acknowledge a number of limitations. First, nine variants have low MAF and were driven by only two cohorts, indicating that further validation will be required to increase confidence in these associations. Second, seven variants were significantly associated with microalbuminuria only, a trait shown to be less heritable in previous studies. Even though the gene-level, gene set and pathway analyses had limited power, these analyses identified several additional potential DKD loci and pathways, some with relevance to kidney biology, that require further follow-up.
Diabetic complications are unquestionably driven by hyperglycemia and partially prevented by improved glycemic control in both T1D and T2D, but there has been doubt over what contribution, if any, inherited factors contribute to disease risk. In line with previous genetic studies, this study with a markedly expanded sample size identified several loci strongly associated with DKD risk. These findings suggest that larger studies, aided by novel analyses and including T2D, will continue to enhance our understanding of the complex pathogenesis of DKD, paving the way for molecularly targeted preventive or therapeutic interventions. and included those with 2 to 20 years of diabetes and excluded those on any antihypertensive medications. Written informed consent was obtained from each participant and the study was approved by the relevant institutional review boards. RASS study participants were followed for 5 years with percutaneous kidney biopsy completed prior to randomization and at 5 years. Structural parameters measured by electron microscopy on biopsy included GBM width, measured by the electron microscopic orthogonal intercept method 17 .

Conceptualization
SUMMIT consortium. The SUMMIT consortium included up to 5193 subjects with type 2 diabetes, with and without kidney disease, of European ancestry. All studies were approved by ethics committees from relevant institutions and all participants gave informed consent 13 .
Complete list of SUMMIT Consortium members provided in Table S13.
Genotyping. Samples were genotyped on the HumanCore BeadChip (Illumina, San Diego, CA, USA), which contains 250,000 genome-wide tag SNPs (and other variants) and over 200,000 exome-focused variants. All samples were passed through a stringent quality control protocol.
Following initial genotype calling with Illumina software, all samples were re-called with zCall, a calling algorithm specifically designed for rare SNPs from arrays. Once calling was completed for all cohorts, our pipeline updated variant orientation and position aligned to hg19 (Genome Reference Consortium Human Build 37, GRCh37). Variant names were updated using 1000 Genomes as a reference. The data were then filtered for low quality variants (e.g. call rates <95% or excessive deviation from Hardy-Weinberg equilibrium) or samples (e.g. call rates <98%, gender mismatch, extreme heterozygosity). Principal Component Analysis (PCA) was performed separately for each cohort in order to empirically detect and exclude outliers with evidence of non-European ancestry. Genotypes were expanded to a total of approximately 49 million by imputation, using 1,000 Genomes Project (phase 3 version 5) as a reference. Genomic features of human kidney. Human kidney-specific chromatin immunoprecipitation followed by sequencing (ChIP-seq) data can be found at GEO: GSM621634, GSM670025, GSM621648, GSM772811, GSM621651, GSM1112806, GSM621638. Different histone markers were combined into chromatin states using ChromHMM 45 .
Glomerular basement membrane measurement in RASS cohort. RASS study participants were followed for 5 years with percutaneous kidney biopsy completed prior to randomization and at 5 years. Structural parameters measured by electron microscopy on biopsy included GBM width, measured by the electron microscopic orthogonal intercept method 17 .

RNA-sequencing of human kidney samples in the University of Pennsylvania cohort.
Human kidney tissue was manually microdissected under a microscope in RNAlater for glomerular and tubular compartments. The local renal pathologist performed an unbiased review of the tissue section by scoring multiple parameters, and RNA were prepared using RNAeasy mini columns (Qiagen, Valencia, CA) according to manufacturer's instructions. RNA quality was assessed with the Agilent Bioanalyzer 2100 and RNA integrity number scores above 7 were used for cDNA production. The library was prepared in the DNA Sequencing Core at University of Texas Southwestern Medical Center. One microgram total RNA was used to isolate poly(A) purified mRNA using the Illumina TruSeq RNA Preparation Kit. We sequenced samples for single-end 100bp, and the annotated RNA counts (fastq) were calculated by Illumina's CASAVA 1.8.2. Illumina sequence quality was surveyed with FastQC. Adaptor and lower-quality bases were trimmed with Trim-galore. Trimmed reads were aligned to the Gencode human genome(GRCh37) with STAR-2.4.1d. The readcount of each sample was obtained using HTSeq-0.6.1 (htseq-count) and then normalized fragments per kilobase million values were used to perform association analysis with fibrosis and sclerosis using linear regression.
Mouse kidney single cell RNA-sequencing. Kidneys were harvested from 4 to 8-week-old male mice with C57BL/6 background and dissociated into single cell suspension as described in our previous study 47 . The single cell sequencing libraries were sequenced on an Illumina HiSeq with 2x150 paired-end kit. The sequencing reads were demultiplexed, aligned to the mouse genome (mm10) and processed to generate gene-cell data matrix using Cell Ranger 1.3 (http://10xgenomics.com) 47 .
RNAseq and microarray profiling of human kidney samples from the Pima cohort. Kidney biopsy samples from the Pima Indian cohort were manually micro-dissected into 119 glomerular and 100 tubule-interstitial tissues to generate gene expression profiles 86  We defined a total of 10 different case-control outcomes to cover the different aspects of renal complications (Figure 1). Five comparisons ("All vs. ctrl", "Micro", "DN", "Macro", and "ESRD vs. macro") were based on albuminuria, measured by albumin excretion rate (AER) from overnight or 24-h urine collection, or by albumin creatinine ratio (ACR). Two out of three consecutive collections were required (when available) to classify the renal status of subjects as either normoalbuminuria, microalbuminuria, macroalbuminuria, or ESRD; for detailed thresholds, see Table S9. Controls with normal AER were required to have a minimum diabetes duration of 15 years; subjects with microalbuminuria/ macroalbuminuria/ ESRD were required to have minimum diabetes duration of 5/ 10/ 10 years, respectively, in order to exclude renal complications of non-diabetic origins. Two comparisons ("ESRD vs. ctrl" and "ESRD vs. non-ESRD") were based on presence of end-stage renal disease as defined by eGFR< 15 mL/min or dialysis or renal transplant. Two phenotypes ("CKD" and "CKD extreme") were defined based on estimated glomerular filtration rate (eGFR; evaluated with the CKD-EPI formula): Controls had eGFR ≥ 60ml/min/1.73m 2 for both phenotypes, and minimum of 15 years of diabetes duration; cases had eGFR <60ml/min/1.73m 2 for the "CKD" phenotype, and eGFR <15 ml/min/1.73m 2 or dialysis or renal transplant for the "CKD extreme" phenotype, and minimum of 10 years of diabetes duration. For the "CKD-DN" phenotype that combined both albuminuria and eGFR data, controls were required to have both eGFR ≥60ml/min/1.73m 2 and normoalbuminuria; cases had both eGFR <45ml/min/1.73m 2 and micro-or macroalbuminuria, or ESRD.
A genome-wide association analysis of each of the case-control definitions was performed using logistic regression under an additive genetic model, adjusting for age, sex, diabetes duration, study site (where applicable) and principal components. As disease onset and progression is also closely related to BMI and HbA1c levels, 89 we conducted a second set of analyses adjusting for BMI and HbA1c which we refer to as our fully adjusted covariate model. Allele dosages were used to account for imputation uncertainty. Inverse-variance fixed effects metaanalysis was performed using METAL and the following filters: INFO score >0.3, minor allele count >10, and presence of variant in at least two cohorts. The X chromosome was similarly analyzed for males and females both separately and in a combined analysis, with the exception of using hard call genotypes in place of allele dosages. The study-wide significance threshold (P<6.76×10 -9 ) was calculated by applying a Bonferroni correction to the traditional GWAS threshold (P<5.00×10 -8 ), based on the number of effectively independent tests, using methods previously described on the eigenvalues of the GWAS summary statistics correlation matrix 14 .

RNA-sequencing of human kidney samples (University of Pennsylvania data).
Normalized fragment per kilobase million values were used to perform association analysis with fibrosis and sclerosis using linear regression. eQTL analysis (Pima data). Analysis was performed with Robust Multi-array Average quantile normalization 90 after removing probes overlapping with variants identified by WGS. Batch effects between platforms were corrected using ComBat 91 and unknown batch effects were also adjusted using singular value decomposition with first four eigenvectors. eQTL mapping was performed using EPACTS (https://genome.sph.umich.edu/wiki/EPACTS) software tool using linear mixed model accounting for hidden familial relatedness, after inverse Gaussian transformation of expression levels, adjusting for age and sex.
Mouse kidney single cell RNA-sequencing. To calculate the average expression level for each cluster, a z-score of normalized expression value was first obtained for every single cell. Then, we calculated the mean z-scores for individual cells in the same cluster, resulting in 16 values for each gene.

DATA AND SOFTWARE AVAILABILITY
All cohorts can share genome-wide meta-analysis summary statistics. Individual level genotype data: due to restrictions set by the study consents and by EU and national regulations, individual genotype data cannot be shared for all cohorts. Table 1. Loci associated with DKD at study-wide (P<6.76×10 -9 , bold) and genome-wide (P<5×10 -8 ) significance. Notable genes from missense variant in the indicated gene (M); intronic, synonymous, or noncoding transcript variant within gene (G); gene nearest to index SNP (N); biological relevance to kidney biology (B). Chr, chromosome; pos, position; EAF, effect allele frequency; OR, odds ratio; min, minimally adjusted covariate model; full, fully adjusted covariate model.