High-throughput complement component 4 genomic sequence analysis with C4Investigator

The complement component 4 gene locus, composed of the C4A and C4B genes and located on chromosome 6, encodes for C4 protein, a key intermediate in the classical and lectin pathways of the complement system. The complement system is an important modulator of immune system activity and is also involved in the clearance of immune complexes and cellular debris. The C4 gene locus exhibits copy number variation, with each composite gene varying between 0-5 copies per haplotype, C4 genes also vary in size depending on the presence of the HERV retrovirus in intron 9, denoted by C4(L) for long-form and C4(S) for short-form, which modulates expression and is found in both C4A and C4B. Additionally, human blood group antigens Rodgers and Chido are located on the C4 protein, with the Rodger epitope generally found on C4A protein, and the Chido epitope generally found on C4B protein. C4 copy number variation has been implicated in numerous autoimmune and pathogenic diseases. Despite the central role of C4 in immune function and regulation, high-throughput genomic sequence analysis of C4 variants has been impeded by the high degree of sequence similarity and complex genetic variation exhibited by these genes. To investigate C4 variation using genomic sequencing data, we have developed a novel bioinformatic pipeline for comprehensive, high-throughput characterization of human C4 sequence from short-read sequencing data, named C4Investigator. Using paired-end targeted or whole genome sequence data as input, C4Investigator determines gene copy number for overall C4, C4A, C4B, C4(Rodger), C4(Ch), C4(L), and C4(S), additionally, C4Ivestigator reports the full overall C4 aligned sequence, enabling nucleotide level analysis of C4. To demonstrate the utility of this workflow we have analyzed C4 variation in the 1000 Genomes Project Dataset, showing that the C4 genes are highly poly-allelic with many variants that have the potential to impact C4 protein function.


Introduction 49
The C4 gene locus, composed of the C4A and C4B genes and located in human chromosomal region 50 6p21.33, encodes for complement component 4 (C4) protein, a key intermediate in the classical and 51 lectin pathways of the complement system(1). The complement system is an important modulator of 52 immune system activity, can activate the innate and adaptive immune response systems(2-4) and is 53 also involved in the clearance of immune complexes and cellular debris. The C4 gene locus exhibits 54 copy number variation (CNV), with each composite gene varying between 0-5 copies per haplotype, 55 and importantly, the gene copy number of C4A and C4B correlate to C4 protein levels(5). C4 genes 56 also vary in size depending on the presence of the HERV-K(C4) retrovirus in intron 9 ( Figure 1A), 57 denoted by C4(L) for long-form and C4(S) for short-form, which modulates expression and is found 58 in both C4A and C4B resulting in four distinct genomic forms of C4 (C4A(L), C4B(L), C4A(S), and 59 C4B(S))(5). 60 C4 is mainly expressed by liver cells, white blood cells, and intestinal epithelial cells(6), but also by 61 central nervous system cells(7). C4 is expressed as two isotypes, C4A and C4B, encoded by the C4A 62 and C4B genes, respectively. The isotypes have nearly identical sequence but are differentiated by a 63 short peptide sequence motif at positions 1120-1125 ( Figure 1B), which are PCPVLD for C4A and 64 LSPVIH for C4B. Additionally, human blood group antigens Rodgers (Rg) and Chido (Ch) are 65 located on the C4 protein at positions 1207-1210(8-10). The Rg epitope is generally found on C4A 66 protein, and the Ch epitope is generally found on C4B protein. The relative locations of the C4A/B 67 specific single nucleotide polymorphisms (SNPs) and the Rg/Ch major epitope encoding SNPs are 68 shown in Figure 1A. 69 C4 CNV has been implicated in the neurological diseases schizophrenia(11,12) and Alzheimer's(13), 70 and there is a large body of evidence connecting C4A deficiency and the development of systemic 71 lupus erythematosus (SLE)(14-16), an autoimmune disease. Additionally, while the role of C4 CNV 72 has yet to be studied in the context of COVID-19 pathology, recent studies have implicated 73 complement hyperactivation with severe SARS-CoV-2 complications (17)(18)(19). 74 Currently, interrogation of C4 CNV is accomplished through digital droplet polymerase chain 75 reaction (ddPCR) (11,20), which is capable of quantifying gene copy number for overall C4, C4A(L), 76 C4A(S), C4B(L) and C4B(S). While this method produces accurate results for C4A and C4B gene 77 copy number and phasing with long and short form, it is intractable for identifying additional 78 sequence variation at scale, including loss of function mutations(21,22) and recombinations(23,24), 79 and is completely blind to novel sequence variation. High-throughput genomic sequence analysis of 80 C4 variants has been impeded by the complex genetic variation exhibited by these genes. One recent 81 tool for assessing C4 sequence variation is the C4A/B analysis workflow hosted on Terra (25) The pipeline is available at: https://github.com/hollenbach-lab/C4Investigator. 129

C4 alignment workflow 130
The structural variation of the C4 gene locus and high-degree of sequence similarity between C4A 131 and C4B necessitates a custom alignment and processing workflow. The first step of the workflow is 132 a Bowtie2(39) alignment to a reference consisting of a short-form of C4B, the long-form of C4A, and 133 TNXB, which is used as a close proximity normalizer gene. Subsequently, the reads aligned to both 134 C4A and C4B are combined, formatted, and indexed according to the aligned read formatting 135 procedure outlined in Marin et al. (2021) to generate an overall C4 alignment used for downstream 136 analysis. The output of this workflow is a C4 depth table spanning from position -285 5'UTR to 137 position 341 3'UTR with depths marked independently for A, T, C, G, deletions, and insertions. 138

C4 copy number determination 139
The median depth of the overall C4 alignment is normalized by the median depth of TNXB to 140 determine the overall C4 gene copy number. Exon 29 TC insertion sequence depth ratio is multiplied by the overall C4 copy to determine the copy 148 of loss of function alleles, this value is subtracted from C4A gene copy number to give the functional 149 C4A copy number. While it is possible for the TC insertion to exist in a C4B sequence, this variant is 150 very rare(40) and there is no solid evidence of it in the datasets we analyzed. A similar approach is 151 utilized for the exon 13 C deletion in C4B to give the functional C4B copy number. 152

C4 sequence analysis 153
The overall C4 depth table is processed to generate a SNP table for positions passing a minimum  154 depth threshold (6 for whole genome sequence data and 20 for targeted sequence data). Heterozygous 155 positions are identified using a depth ratio of 0.5 normalized by the determined C4 gene copy 156 number. The output of this step is an overall C4 SNP C4Investigator was run over both targeted sequencing datasets using a minimum depth of 20 for 177 variant calling and a ratio of 0.50, normalized by the total copy of C4, for heterozygous position 178 identification. C4Investigator results were compared to ddPCR results to provide validation for C4 179 interpretation from targeted sequence data. 180

ddPCR genotyping 181
Gene copy number for C4A, C4B, C4(L) and C4(S) were determined by ddPCR as described 182 previously(11) for 38 samples of African ancestry and 37 samples of European ancestry to provide a 183 copy determination comparison dataset. 184

1000 Genomes Project analysis 185
Reads aligned to C4 and the nearby region were extracted from GRCh38 aligned CRAM files using 186 the coordinates outlined in Investigation into the discordant C4A and C4B samples showed the ratios of C4A were near the copy 222 thresholds for both methods (Figure S1A), further examination into the C4A/B Terra k-mer quality 223 scores showed the discordant samples had a median quality of 9, while concordant samples had a 224 median quality of 62.7 (Figure S1B). A similar analysis was performed for the C4(L) and C4(S) 225 discordant samples, which showed the C4Investigator ratios were near the copy thresholds, while the 226 C4A/B Terra workflow ratios were clustered near the center of the copy intervals ( Figure S2). 227

1000 Genomes Project -C4 copy number analysis 228
Analysis of C4 copy number variation across superpopulations showed most individuals across all 229 superpopulations had 4 copies of C4 overall, 2 copies of C4A, and 2 copies of C4B, and there were 230 very few individuals with 0 copies of C4A or C4B (Figure 2). Outside of these similarities there were 231 stark differences observed between the superpopulations. The African (AFR) and European (EUR) 232 superpopulations had much higher occurrences of 3 overall copies of C4, almost double that 233 observed in the other superpopulations, and much lower occurrences of 5 and 6 overall copies of C4. 234 In contract, the South Asian (SAS) superpopulation had the lowest occurrence of 3 overall copies of 235 C4, but the highest of 5 and 6. One of the largest differences observed was with C4L copy 2 for the 236 AFR superpopulation, which was observed at over double the rate of the other superpopulations; this 237 superpopulation also had substantially lower C4L copy 3 occurrence and virtually no occurrence of 4 238 copies. The C4S copy 0 occurrence for the AFR superpopulation was negligible, while other 239 superpopulations were over 20%. 240

1000 Genomes Project -SNP analysis 241
The SNP tables output by C4Investigator, which represent combined C4A and C4B sequence, were 242 parsed to identify sequence variation, and any identified exonic nucleotide variants are evaluated for 243 amino acid coding change. From these results we have summarized non-synonymous mutations in 244 Table 3, and SNP variation that is not represented in the main assembly of the GRCh38 reference in 245 Figure 3. 246 Analysis of allele frequencies for C4A and C4B non-synonymous exonic sequence variation showed 247 large variations in frequencies across populations ( were only found in the AFR superpopulation. 254 An analysis into non-reference SNVs, which are variants not represented in the main assembly of 255 GRCh38, for the 1KGP dataset across C4A and C4B showed 251 variant positions with total non-256 reference variant copy of at least 10 ( Figure 3A, Table S2 An examination of the proportion of the 1KGP dataset that carry rare variants showed that almost 262 25% of the samples carried exonic variants with global allele frequencies at or below 1% (Figure 3B, 263 Table S3), and about 50% carried intronic variants. Looking at the carrier distribution of more 264 common variants showed that about 70% of the samples carried exonic variants with global allele 265 frequencies below 5%, and about 85% carried intronic variants. 266

1000 Genomes Project -recombinant analysis 267
Analysis of carrier frequencies for C4A/C4B and Rodger/Chido recombinants, C4A-Ch and C4B-Rg, 268 showed higher overall frequencies of the C4A-Ch recombinant compared to C4B-Rg ( copy. An investigation into discordant C4A/B results showed that the discordant samples had lower 292 base quality scores on average ( Figure S1B), with neither method showing clear copy number results 293 for the discordant samples ( Figure S1A). In contrast, the investigation into discordant HERV results 294 showed a marked difference between the two methods, with the C4A/B Terra workflow showing 295 clear copy numbers for these samples while C4Investigator had unclear determinations ( Figure S2). 296 This is likely due to the additional structural variant processing of the C4A/B Terra workflow, which 297 incorporates Genome STRiP (36), a workflow specifically developed for identifying copy number 298 variation in WGS data. The C4A/B Terra is strictly focused on identifying copy number variation, a 299 task that it appears to perform very well. In contrast, C4Investigator takes a different approach, 300 focusing on identifying nucleotide variants in a copy variable system through the utilization of 301 custom alignment processing algorithms, which has enabled the identification and quantification of 302 SNP variation across the C4 genes. 303 An analysis into C4 copy number variation between superpopulations (Figure 2) demonstrated some 304 specific patterns, such as a median overall C4 copy number of 4, and a median copy number for C4A 305 and C4B of 2 each, but also important distinctions between populations, such as the strikingly high 306 number of C4L copy 2 genotypes in the AFR superpopulation, and the general imbalance between 307 overall C4 copy of 3 and 5, which was unique for each superpopulation. Differences of this nature 308 might suggest evolutionary pressure or unique genomic makeups that are specific to the different 309 superpopulations and modulate the fitness of different C4 gene structures. 310 An essential innovation of C4Investigator is demonstrated by its capacity to reveal important 311 differences in sequence variation between populations, with likely important functional implications. 312 An analysis of non-synonymous exonic sequence variants demonstrated that C4 sequence makeup 313 can differ greatly between populations, with some variants with seemingly rare global allele 314 frequencies showing high allele frequencies in specific populations. For example, the p.A1413P and 315 the p.P1530S mutations were absent in most populations, but both had 10.2% allele frequency in the 316 MSL population ( Table 3). The fact that both mutations have the same allele frequency raises the 317 question of if these mutations are in-phase, unfortunately, there is a 2046bp gap between these 318 variants which was outside the scope of our phasing approach. However, an examination of the 319 individuals that carried each mutation showed a high overlap, where 28 individuals carried both 320 mutations compared to total 33 individuals carrying the p.A1413P mutation and 31 individuals 321 carrying the p.P1530S mutation. A structural interrogation of C4·MASP-2 binding shows the 322 p.A1413P mutation occurs in the middle of a MASP-2 exosite(31) (Figure 1), while the change from 323 alanine to proline would not likely change the electrostatic interactions between C4 and MASP-2, it 324 could potentially alter the structure of the binding site. Another sequence variant with potential to 325 impact function is the p.P478L mutation, which causes severe reduction of hemolytic activity by 326 disruption of C5 binding(28). Similar analyses in the context of disease association studies are likely 327 to reveal important insights into immune-mediated pathogenesis. 328 An analysis into C4A and C4B non-reference variants demonstrated that the C4 genes are highly 329 poly-allelic across introns, exons and the HERV region ( Figure 3A). Further examination into rare 330 variant carrier frequencies demonstrated that exonic variants under 5% global allele frequency are 331 carried by around 70% of the 1KGP samples ( Figure 3B). This analysis demonstrates the value of 332 nucleotide level analysis of C4, which reveals important features of genomic variation not otherwise 333 evident with existing methods. 334 One important aspect of SNP variation identification is the ability to phase variants. However, 335 phasing high-copy variants (gene copy number > 2) is very complex and it is difficult to be certain of 336 phasing completeness due to the high potential for missing information. Due to the high sequence 337 similarity between C4A and C4B, the alignments must be treated as a single gene, exacerbating the 338 high-copy phasing problem. We have implemented read-backed phasing that enables us to determine 339 whether two variants in proximity are in-phase, but the potential for missing information means in 340 many cases we cannot make the determination that two variants are not in-phase; essentially, we can 341 make more confident true positive phasing calls than true negative. Because of the distance between 342 the C4A/C4B SNPs and the Rg/Ch SNPs, 440bp, we can determine presence of recombinants 343 between the two SNP groups. An estimate of phasing completeness between C4A-Rg and C4B-Ch 344 showed this phasing approach only missed a small percentage of samples. Utilization of this phasing 345 approach to identify C4A-Ch and C4B-Rg recombinants showed high C4A-Ch carrier frequencies 346 across the AFR superpopulation (Table 4), and appreciable carrier frequencies for the C4B-Rg 347 recombinant and the AMR and SAS superpopulations. 348 In conclusion, C4Investigator fills a critical role in the investigation of C4 variation, processing WGS 349 data to provide C4 copy number variation and full genomic sequence information.    overall copy represents the total copy number of C4A and C4B, C4S represents the total copy number 564 for the short-forms of C4A and C4B, and C4L represents the total copy number for the long-forms of 565 C4A and C4B. AFR = African, AMR = Admixed American, EAS = East Asian, EUR = European, 566 SAS = South Asian. 567 global allele frequency thresholds from 0.00-0.05 for introns, exons, and the HERV region. The y-574 axis represents the total proportion of carriers that carry a non-reference allele that is at or below the 575 global allele frequency threshold on the x-axis. For example, nearly 25% of the 1KGP dataset carried 576 exonic variants with a global allele frequency of 1% or lower. 577