Limited contribution of rare, noncoding variation to autism spectrum disorder from sequencing of 2,076 genomes in quartet families

Genomic studies to date in autism spectrum disorder (ASD) have largely focused on newly arising mutations that disrupt protein coding sequence and strongly influence risk. We evaluate the contribution of noncoding regulatory variation across the size and frequency spectrum through whole genome sequencing of 519 ASD cases, their unaffected sibling controls, and parents. Cases carry a small excess of de novo (1.02-fold) noncoding variants, which is not significant after correcting for paternal age. Assessing 51,801 regulatory classes, no category is significantly associated with ASD after correction for multiple testing. The strongest signals are observed in coding regions, including structural variation not detected by previous technologies and missense variation. While rare noncoding variation likely contributes to risk in neurodevelopmental disorders, no category of variation has impact equivalent to loss-of-function mutations. Average effect sizes are likely to be smaller than that for coding variation, requiring substantially larger samples to quantify this risk.


Introduction 74
The rapid progression of genomics technologies, coupled with expanding cohort sizes, have led 75 to significant progress in characterizing the genetics of autism spectrum disorder (ASD). To 76 date, studies of ASD cohorts have included genotyping array technologies to survey large copy 77 number variations (CNVs) 1-6 and common variants, 7,8 exome sequencing to scan the protein 78 coding genome, 1,9-16 and long-insert sequencing to identify large chromosomal 79 abnormalities. 17,18 While genetic variation across the allele frequency spectrum influences ASD 80 risk, 19 robust discovery of specific genetic loci has been driven by the identification of extremely 81 rare de novo mutations that are predicted to disrupt protein coding genes. Since these 82 mutations are newly arising in the child, they receive limited exposure to natural selection and 83 can therefore exert considerable risk for ASD, given the well documented reduction in fecundity 84 in ASD cases. 20 Two factors have driven locus discovery in ASD: the presence of critical sites in 85 coding genes that, when mutated, severely disrupt gene function leading to dramatic biological 86 consequences, and the ability to predict such disruption based on gene models, either through 87 large-scale deletion or the annotation of point mutations using the triplet genetic code. 88 89 Most ASD subjects do not carry either gene disrupting point mutations or large de novo CNVs, 1 90 hence assaying de novo noncoding mutations could identify uncharacterized reservoirs of 91 genetic risk. Yet, while the vast majority of de novo mutations (97%) arise outside the coding 92 genome, they present an interpretive challenge. Unlike the coding region, we do not have the 93 same cipher, the triplet code, to predict which nucleotides will critically alter gene function when 94 mutated and which will be functionally inert. Association of noncoding variation with complex 95 traits is well-documented, with the overwhelming majority being common variants mapping 96 outside of gene regions and often in proximity to putative regulatory domains. These common 97 variant associations typically have modest effect sizes. While the impact on gene expression 98 levels, splicing events, or other regulatory processes is defined for some noncoding 99 (RR) of rare noncoding variants will be modest, they will be distributed widely across the 125 genome, and sample sizes required to identify them will need to be substantially larger. 126 127 Cohort selection and characteristics 128 All 519 cases were selected from the SSC based on the absence of de novo loss-of-function 129 mutations or large de novo CNVs in prior data, with the objective of enriching for undiscovered 130 de novo variation. The majority of cases (92%, N=480/519) were selected randomly after this 131 exclusion, however the remaining 8% were selected for a pilot study 25 to increase the 132 representation of older fathers, female cases, and cases with comorbid intellectual disability (ID; 133 defined here as nonverbal IQ ≤70), all of which have been associated with increased rates of 134 protein-damaging mutations. 1 Of the 519 WGS cases, 10.6% are female, which is lower than 135 the 15.0% (p = 0.02) in cases excluded due to known de novo mutations and the 14.1% (p = 136 0.04) in the remainder of the SSC without WGS data. No significant differences were observed 137 in the fraction of cases with ID, which were 25.8%, 26.0% and 25.2%, respectively. 138

139
The contribution of coding de novo mutations to neurodevelopmental disorders is a continuum 140 ranging from severe intellectual disability, with de novo loss-of-function mutations contributing 141 risk in 18% of cases in the Deciphering Developmental Disorders (DDD) cohort, 26 to later-onset 142 disorders, such as schizophrenia in which de novo loss-of-function mutations are unlikely to 143 contribute to more than 2% of cases. ASD falls between these two extremes, with about 7% of 144 SSC cases carrying such mutations. The contribution of inherited (largely common) variation 145 appears to run in the opposite direction, as reflected by the high sibling recurrence rates in 146 ASD 27 and schizophrenia 28 compared to ID cases. 29 Given this relationship, we predicted 147 common variant ASD burden from microarray data of the 1,631 families in the SSC of European 148 ancestry (Extended Data Fig. 1). As expected, we observed a lower burden of common variant 149 risk in cases excluded due to known de novo mutations than in our WGS cohort and the 150 remainder of the SSC (p=0.03, one-sided t-test), but no difference between our cohort and the 151 remainder of the SSC. 152 153

Single nucleotide variants and insertion-deletions 154
Single nucleotide variants (SNVs) and small insertion-deletions <50 bp (indels) were discovered 155 in the new WGS subset using the Genome Analysis ToolKit (GATK), 30 and family structure was 156 leveraged to define high quality calls (Extended Data Fig. 2-5). Overall, we identified 3.7 million 157 high quality, autosomal variants per individual, including 3.4 million SNVs and 0.3 million indels. 158 From these variants, de novo SNVs and indels were predicted using multiple detection 159 algorithms and excluding low complexity regions. These predictions were ensured to be of high 160 confidence by tuning and subsequent validation (Extended Data Fig. 5 The sheer diversity and complexity of noncoding functional annotations necessitates a strategy 187 to interpret the multiple parallel hypotheses. We first assessed whether there was evidence of 188 an excess of variants in cases within regions of the genome defined by genes. As noted, the 189 cohort included only cases that did not carry a de novo loss-of-function coding mutation in prior 190 analyses by WES. 1 Using Gencode gene definitions, we surveyed four coding categories, e.g. 191 missense, and seven noncoding categories, e.g. UTRs (Fig. 1). In all analyses, we tested for an 192 enrichment of mutations mapping to these regions in cases compared to their sibling controls, 193 and then assessed the significance of this enrichment using 10,000 case/control label-swapping 194 permutations comparing the number of de novo mutations corrected for paternal age and 195 sequencing quality metrics. This analytical approach is used throughout the manuscript, unless 196 otherwise noted. After correcting for multiple comparisons, no significant excess of de novo 197 variants in any gene-defined category was observed. We repeated the analysis considering 198 SNVs and indels separately (Extended Data Fig. 9-10), and considering only variants within or 199 near to one of 179 genes associated with ASD at a liberally defined false discovery rate (FDR < 200 0.3). 1 Only an excess of de novo missense mutations is apparent (Fig. 1) We next designed an unbiased WGS-association framework for the noncoding genome in ASD. 229 We integrated five approaches to annotation: 1) ASD-associated gene lists (e.g., targets of 230 FMRP); 2) functional annotation (e.g., chromatin state); 3) conservation across species; 4) type 231 of variant (SNVs, indel); and 5) gene-defined categories described above. In total we surveyed 232 51,801 non-redundant annotation categories derived from combinations of these five annotation 233 approaches. In the absence of a clear a priori hypothesis, we treated all of these category 234 comparisons equally and compared the burden of de novo mutations in cases vs. controls (Fig.  235 2a) in a category-wide association study (CWAS). The most strongly associated categories 236 were from coding variants, while the top noncoding category was from mutations underlying 237 H3K36me3 peaks that were nearer to lincRNAs than to other transcripts (Table 1)

Table 1. Burden results for most significant or previously implicated annotation categories
While no single category met this threshold, we considered whether there was evidence of a 291 tendency towards enrichment of categories in cases, suggesting an underlying signal. We 292 therefore counted the number of nominally significant categories and compared this to 293 expectation based on permutation and controls (Fig. 2b). We observed more significant tests 294 than expected in cases in coding regions (p = 0.01) but not noncoding regions (p = 0.21), both 295 overall and near ASD genes. This result gives important insight into genomic architecture; as 296 cohort size increases we should anticipate that noncoding signal will remain weaker than the 297 coding signal, unless annotation approaches improve dramatically. Moreover, since cases with 298 known loss-of-function coding mutations were excluded from this sample, this suggests that the 299 noncoding signal will likely be more modest than the signal from missense coding mutations. 300 Interestingly, tests of annotation categories for de novo indels separate from SNVs showed a 301 greater number of significant results than expected, and this enrichment was stronger for 302 noncoding (p = 0.04) than coding indels (p = 0.10). Indels may represent a sweet spot for 303 statistical power in interrogating the noncoding genome; they can disrupt regulatory elements to 304 a greater degree than SNVs by virtue of their size while being detected in considerably greater 305 numbers than SVs. 306 307 To further assess the role of rare noncoding variation for ASD we developed a polygenic risk 308 score based on de novo variants, akin to similar scores developed previously for common and 309 rare variants. 34,35 The rate of de novo mutations in cases and controls was weighted based on 310 the category RR and adjusted for p-value correlation structure (Fig. 3). Cross validation was 311 used to select annotation categories that best predicted case-control status. In keeping with the 312 modest differences observed between cases and controls, the derived score was not able to 313 accurately predict case status, further supporting a limited role for rare noncoding mutations in 314 this cohort. Of note, this model did not explicitly highlight the contribution of coding mutations, 315 with the majority of selected categories relating to overall de novo burden (e.g. all variants, all 316 intronic variants, and all intergenic variants). However, the model did highlight the role of two 317 other functional annotations: conservation scores across vertebrate species and variants near 318 long intergenic noncoding RNAs (lincRNAs, Fig. 3). Though neither finding is significant after 319 correcting for multiple comparisons (Fig. 3)

Structural variation 334
Though no definitive noncoding signal was observed for small mutations, the strongest trends 335 were observed in indels, in keeping with their larger size and presumed greater disruption to 336 regulatory elements than SNVs (Fig. 2b). Following this logic, we assessed whether structural 337 variants (SVs), which can rearrange and potentially disrupt large segments of the genome, 338 might demonstrate a noncoding signal. We integrated the results of seven prediction algorithms 339 to capture both changes in read-depth (three algorithms) and clusters of anomalously pairing 340 reads indicating an SV breakpoint (four algorithms; see Online Methods). We then developed a 341 series of post hoc algorithms, called RdTest, to correct for the limited concordance among 342 individual algorithms (Extended Data Fig. 22). The method jointly tests for a significant 343 difference in the read-depth signal supporting each predicted CNV against the normalized 344 cohort background, and performs local k-means clustering to predict the likely presence of 345 multiple copy states. We next integrated the statistically significant CNV segments with 346 predicted balanced events using a series of breakpoint linking methods to identify signatures of 347 10 canonical balanced and complex SV classes, 36 of which 64.5% altered copy number (e.g., 348 paired-duplication inversion 37 ) and 35.5% were copy number neutral. We compared standard WGS to 1,332 high quality CNVs previously reported from microarray 360 data in the SFARI cohort (Extended Data Fig. 24), 1 and observed an overall sensitivity of >99% 361 and a 5.2% false discovery rate (FDR). We relied on long-insert WGS (liWGS; 3.5 kb inserts, 362 median physical coverage of 102x) to validate SVs undetected with microarray (including small 363 CNVs, copy-neutral balanced SV, and complex SV) and found a 4.3% overall FDR for 2,238 SV 364 calls (Extended Data Fig. 24). Consistent with the comparisons to microarray and liWGS, cross-365 site validation using PCR and Sanger sequencing confirmed 92.3% of our predictions, 366 suggesting high specificity from these analyses, very likely at the cost of sensitivity for small 367 variants (see Methods), though we have no gold standard to determine this with certainty. 368 369 These analyses predicted 105 de novo SVs in the cohort, including 92 germline and 13 370 apparent mosaic SVs (Extended Data Figs. 25-27). In addition, we found that five subjects had 371 sex chromosome aneuploidies (0.7% of SSC probands, 0.2% of siblings; Extended Data Fig.  372 28), and discovered nine SVs initially predicted to arise de novo that demonstrated evidence of 373 germline mosaicism in a parent. Given the rarity of de novo SVs, there were limited data to 374 derive insights comparable to those from de novo SNVs and indels. There was no significant 375 difference in de novo SV burden between cases and controls (see Methods for sibling 376 comparisons), though we did observe a small increase in risk among cases (RR = 1.53, p = 377 0.07). There was also a non-significant enrichment in ASD cases for de novo SVs localized to 378 exons (2.3% versus 0.6%; RR = 3.7; p = 0.06), suggesting that there is a slightly increased 379 burden of previously undetected SVs that disrupted protein coding sequence in ASD cases, and 380 this result was more pronounced if we excluded multi-allelic (n = 20) and mosaic (n = 13) SVs 381 (RR = 9, p = 0.02). There were de novo SVs that represented potential loss-of-function variants 382 within ASD-associated genes, which included an exonic deletion of CHD2 and a balanced 383 translocation that disrupted GRIN2B (Fig. 4). Several other genes were disrupted by SVs in 384 cases that were predicted to be intolerant to loss-of-function mutations (pLi ≥ 0.9 32 ), but not 385 associated with ASD from TADA analyses (LNPEP, PAK7, SAE1, ZNF462, DMD), while one 386 such disruption occurred in a sibling (USP34). Overall, these analyses suggest that de novo 387 loss-of-function SVs that were intractable to microarray may translate to a 1. Finally, we identified signatures of large SVs that were not detected by microarray in the SSC, 426 revealing that 0.9% of ASD cases (N=5) harbored a large balanced chromosomal abnormality 427 (>3 Mb), and 429 CNVs >40 kb were detected by WGS but not microarray (Extended Data Fig.  428 24). Despite this improved power and resolution for SV detection, we found no significant 429 differences in the rate of rare inherited SV as a mutational class in ASD, nor did we observe any 430 evidence of biased transmission of any class of SV from either parent (Extended Data Fig. 29).  Fig. 30). Similarly, no excess 455 of mutations was observed by further filtering to variants in proximity to 179 ASD genes defined 456 by WES at a false discovery rate of 0.3 1 (Extended Data Fig. 30-31). Contrary to previously 457 published analyses, we also find no evidence of enrichment for disruption of DHS sites in 458 proximity to all genes, or ASD-associated genes, at any sliding window distance extending up to 459 1 Mb (Extended Data Fig. 32), nor did we observe enrichment of paternally inherited SV 460 disrupting any class of functional annotation in proximity to all genes, constrained genes, or 461 those genes previously associated with ASD. 462 463

Integration and estimation of noncoding risk in ASD 464
An excess of de novo loss-of-function mutations and of de novo missense mutations has 465 previously been described in WES data with RRs of 1.75 and 1.15, respectively. 14 Resampling 466 these WES data finds that about 300 families are required to observe the de novo loss-of-467 function burden (80% power, alpha = 0.05), while over 1,500 families would be necessary to 468 observe the de novo missense burden (Fig. 3). If we count the number of de novo missense 469 mutations in cases versus controls in the current WGS sample, the RR is only slightly inflated in 470 cases (414/404 = 1.02) and it is not significantly different than 1.00, as expected from this power 471 calculation. If, with the benefit of hindsight, we consider only 179 genes previously associated 472 with ASD at a liberal false discovery rate of 0.3 1 as a sole endpoint of our analyses, we find a 473 much higher RR of 2.6 (21/8), which is significantly different from 1.0 (p = 0.01, one-sided 474 binomial test, Fig. 2). As noted, however, this result does not survive correction for multiple 475 comparisons and it is probably somewhat biased by the inclusion of these 519 families in the 476 original WES analyses that defined the 179 genes. Moreover, filtering missense mutations 477 instead by conservation, constrained genes, or brain-expressed genes, does not yield nominally 478 significant evidence for risk. 479 480 These results give important context to interpreting the WGS data for 519 families and for the 481 larger sample sets of the future. At 519 families, we should expect a noncoding signal 482 equivalent to de novo loss-of-function to be nominally significant (alpha = 0.05), but not expect 483 this of a signal equivalent to de novo missense until the sample size exceeds 1,500 families. As 484 noted (Fig. 2), the noncoding signal we observe is weaker than that seen for de novo missense 485 mutations. Furthermore, the best chance of achieving a significant test lies in integrating data 486 that enriches for ASD-associated signal, such as proximity to ASD genes. Yet, when we 487 searched over the space of de novo SNVs, indels, SVs, and rare homozygous variants, they 488 showed no detectable concentration near bona fide or even likely ASD genes. Nor did these 489 variants concentrate in any particular region of the genome, as could occur if disruption of a 490 particular noncoding region were associated with large relative risk. Finally, they did not 491 concentrate notably in any annotation category that we tested. 492 493 Without the triplet genetic code of the protein coding sequence we could not have distinguished 494 loss-of-function, missense, and silent variants in the exome data and would expect a RR of 1.12 495 for all de novo mutations in coding regions. We would require 1,000 families to detect this 496 burden (80% power, alpha = 0.05), over three-fold more than required to detect loss-of-function 497 alone. This analogy represents the challenge of assessing noncoding regulatory risk from WGS 498 data, exacerbated by the likelihood that regulatory variants are, as a group, unlikely to confer 499 the same level of risk as loss-of-function variation. Moreover, because we have yet to discover 500 the functional elements critical for disease risk, rather than specify them a priori, it induces a 501 search over a large number of putatively functional elements and mandates far more stringent 502 thresholds for statistical association as we have used. 503 504 To estimate the sample sizes required to discover annotation categories enriched for noncoding 505 variation, we performed a power calculation across estimates of RR and numbers of variants 506 per annotation category. Because these categories show complex correlation structure, and 507 therefore simple corrections for multiple testing are inappropriate, we used eigenvector analysis 508 to estimate the effective number of tests conducted. As sample size increases, the correction for 509 number of categories becomes somewhat larger due to increased likelihood of observing a total 510 number of de novo mutations in any given annotation category that is sufficient to achieve 511 significance: the number of effective tests increases from ≈4,200 at 519 families to ≈7,600 at 512 4,000 families and approaches an asymptote of ≈10,000 (Fig. 3). The multiple testing burden 513 produces a threshold for statistical significance on the order of 5 x 10 -6 . In this setting, over 514 4,000 families would be necessary to discover a noncoding element equivalent to missense 515 variation. 516 517

Conclusion 518
Refinements in DNA sequencing, computing capability, and statistical analyses now permit 519 simultaneous evaluation of the coding and noncoding genome in many thousands of individuals. 520 This eventually will precipitate a sea change in how we interpret the impact on ASD risk of rare 521 variation throughout the genome. Yet, the complexity of the noncoding genome complicates 522 interpretation for both de novo and inherited variation, and there are perils in underestimating its 523 complexity. A priori prediction by experts of which regulatory elements of the noncoding genome 524 should be important will limit the number of tests evaluated, and one could argue this limits the 525 required correction for multiple testing. We find this argument wanting in terms of establishing a 526 robust, unbiased framework to interpret disease association. Perhaps the simplest way to 527 understand why is by analogy to common variants and a comparison of current-day genome-528 wide association studies (GWAS) versus the candidate gene tests of a previous era. GWAS 529 results have a good record for replication, in large part because the field requires, for any study, 530 large samples and appropriate correction for multiple testing. By contrast, despite investigator 531 intuition about what genes are important to disease risk, candidate gene studies have had a 532 miserable record regarding replication. This history of candidate gene studies, with a plethora of 533 false positive and a paucity of true results, 41 should make us highly skeptical of methods based 534 on investigator-selected a priori hypotheses in the noncoding genome. Continuing the analogy, 535 instead of candidate genes, the field would be substituting "candidate annotations", with all 536 likelihood of worse outcomes, due to myriad combinations of annotation, cell type, brain region, 537 and developmental stage. 538

539
We anticipate that large-scale functional assays will continue to provide increasingly insightful 540 annotation of the regulatory genome enabling future studies to better characterize and quantify 541 the precise contribution of noncoding regulatory variation to ASD. In addition, high-throughput 542 methods to validate noncoding variant function, such as STARR-Seq, 42 for which there is no 543 equivalent for coding missense mutations, could refine noncoding signals, potentially to the 544 degree of implicating specific noncoding loci. Until that time, we recommend the GWAS path for 545 WGS studies: rigorous evaluation of multiple hypotheses and appropriate correction for that 546 multiplicity, as we have outlined here. If we hold to these standards, it will require very large 547 sample sizes to make headway, but we predict that the ensuing inferences will be sound and 548 replicable.

Detection of high quality SNVs and indels 848
As we had no established best practices or predetermined filtering criteria available for rare 849 variants in WGS data, we developed an optimized set of thresholds for various quality metrics to 850 detect rare SNVs and indels. For this, we compared two sets of rare variants which have the 851 most distinct quality metrics -1) private transmitted variants (only observed in one family and 852 no frequency given in the 1000 Genome Project or ExAC database), which are likely true 853 variants, and 2) variants that are Mendelian violations in at least one child but are also observed 854 in an unrelated individual, which are likely false positive calls. The ability of individual quality 855 metrics obtained from the final VCFs to distinguish these true variants from false variants was 856 assessed using receiver operating characteristic (ROC) curves. The metric and threshold that 857 yielded the maximum increase of specificity and the minimum decrease of sensitivity was 858 selected after which the training set was filtered by these criteria and the process repeated. This 859 sequential ROC analysis was repeated until we no longer observed improvement in sensitivity 860 and specificity. To identify high quality de novo variants from the call set, we applied the same sequential ROC 872 approach as above with true positive calls defined by PCR Sanger validation de novo mutations 873 from prior work (1,302 selected SNVs; 95 selected indels). Sequential ROC curve analyses 874 were applied to all variant-and individual-level quality metrics for the child and both parents. 875 This analysis predicted 87.3% sensitivity and 98.8% specificity for SNVs using 3 additional 876 metrics, and 86.3% sensitivity and 93.0% specificity for indels using 4 additional metrics. 877

878
Validation of high quality de novo SNVs 879 From the 66,366 high quality de novo SNVs, 250 mutations were selected at random (based on 880 available DNA) for validation in the child and both parents using PCR amplification and high-881 throughput sequencing on an Illumina MiSeq. We examined PCR products from all 250 child 882 reactions on a gel and 13 (5%) failed to make a product and were excluded from the analysis. 883 Of the remaining 237 putative mutations, we observed an overall mean coverage of 26,818X. 884 Based on investigation of off-target coverage, we determined that a depth coverage ≥ 50X was 885 required to ensure an accurate genotype and any samples that failed to achieve this coverage 886 were considered sequencing failures due to insufficient depth. All putative mutations in the child 887 met this threshold, however for 7 of these, no variant was detected in the child. In the remaining 888 230 putative mutations, 18 had insufficient coverage in one or more parents and were excluded 889 from the analysis. The remaining 212 putative mutations with sufficient coverage in the child and 890 both parents all validated as de novo; no inherited variants were observed. Our overall 891 confirmation rate for de novo SNVs was therefore 96.8% (212/219; 212 validated versus 7 with 892 sufficient coverage but no variant in the child). 893

894
Validation of high quality de novo indels 895 From the 9,961 high quality de novo indels, 250 indels (125 non-coding deletions and 125 non-896 coding insertions) were selected at random for validation using PCR amplification and high-897 throughput sequencing on an Illumina MiSeq. Of these, 16 were larger than 50bp and were 898 excluded from the analysis (de novo confirmation rate of 6%). We examined PCR products from 899 all of the remaining 234 child reactions on a gel and 7 (3%) failed to make a product and were 900 excluded from the analysis. Of the remaining 227 putative mutations, we observed an overall 901 mean coverage of 19,461X, however 7 failed to meet our threshold of ≥ 50x coverage in the 902 child and were excluded from the analysis. Of the remaining 220 putative mutations, 75 failed to 903 identify a variant in the child despite adequate coverage. In the remaining 145 putative 904 mutations, 8 had insufficient coverage in one or more parents and were excluded from the 905 analysis. Of the remaining 137 putative mutations with sufficient coverage in the child and both 906 parents, 131 validated as de novo and 6 were inherited from one parent. Our overall 907 confirmation rate for our first round of de novo indels <50bp was therefore 61.8% (131/212; 131 908 validated versus 6 inherited indels and 75 with sufficient coverage but no variant in the child). We also attempted validation for four putative mutations in known ASD-associated genes: one 922 SNV in ADNP, chr20:49548007; two SNVs in GABRB3, chr15:26327365 and chr15:26327513; 923 and one indel in NRXN1, chr2:51259257. All four mutations were validated as de novo. 924 925

Detection of high quality de novo structural variants 926
Algorithm integration and variant adjudication: We used a two-tier SV detection pipeline, in 927 which we integrated four paired-end/split-read (PE/SR) algorithms and three read-depth (RD) 928 algorithms to discover a maximal list of candidate SV loci, then adjudicated each predicted 929 variant with a joint analysis of the cohort that included a statistical test for likely de novo status 930 of each alteration. Our pipeline incorporated PE and SR calls from Delly v0.7.3, 61 Lumpy 931 the SSC, as we have done previously in this cohort with large SVs. 36 We applied the algorithm 958 integration pipeline for PE/SR calls described above to obtain a set of candidate inversion and 959 translocation breakpoints. We first used bedtools to overlap these breakpoints with the CNV loci 960 predicted to be significant by RdTest to identify complex SV with large associated CNV, then to 961 identify candidate pairs within the remaining breakpoints that could constitute a resolved SV. 962 We resolved the variant structure at each of these loci by matching the ordering of breakpoints 963 to complex SV signatures previously identified by Collins et al., 36 and used RdTest to evaluate 964 read-depth support at novel CNV sites associated with complex inversions. We identified 19,342 965 observations of 127 such inversion-associated CNV between 300 bp and 4 kb that were not 966 found with the CNV discovery pipeline, as they lacked canonical PE/SR evidence and were 967 below RD-only algorithm resolution. In total, we identified 38, 658