De novo indels within introns contribute to ASD incidence

Copy number profiling and whole-exome sequencing has allowed us to make remarkable progress in our understanding of the genetics of autism over the past ten years, but there are major aspects of the genetics that are unresolved. Through whole-genome sequencing, additional types of genetic variants can be observed. These variants are abundant and to know which are functional is challenging. We have analyzed whole-genome sequencing data from 510 of the Simons Simplex Collections quad families and focused our attention on intronic variants. Within the introns of 546 high-quality autism target genes, we identified 63 de novo indels in the affected and only 37 in the unaffected siblings. The difference of 26 events is significantly larger than expected (p-val = 0.01) and using reasonable extrapolation shows that de novo intronic indels can contribute to at least 10% of simplex autism. The significance increases if we restrict to the half of the autism targets that are intolerant to damaging variants in the normal human population, which half we expect to be even more enriched for autism genes. For these 273 targets we observe 43 and 20 events in affected and unaffected siblings, respectively (p-value of 0.005). There was no significant signal in the number of de novo intronic indels in any of the control sets of genes analyzed. We see no signal from de novo substitutions in the introns of target genes.


Introduction
We have made great strides in our understanding of the genetic determinants of autism over the 26 past decade. These come largely from the search for new germ line (de novo) mutations in simplex 27 families, that is, those with a single affected child. The major signal comes from exome sequence data, 28 and in particular from the mutations that disrupt protein coding sequences [1,2]. The best estimate of 29 the contribution from de novo mutation derives from the observed differential incidence rates in 30 affected and unaffected siblings, and extrapolates to about 30%. Using a variety of methods for analysis 31 of the number of recurrent gene targets, we can further estimate that the number of strongly penetrant 32 causal targets for de novo mutation is on the order of 500 genes [1]. Using the observation that target 33 genes, and especially recurrent target genes, are enriched for genes under strong negative selective 34 pressure in humans, we can now identify on the order of 200 excellent candidate target genes, those 35 that are both targets and under strong selective pressure [3]. 36 Potentially, we can learn more from whole genome sequencing data, although the rules for 37 interpreting such data are not yet clear. Two recent reports that studied the relationship between non-38 coding variants and autism demonstrate these difficulties and the need for analysis of whole-genome 39 data from large collations [4,5]. In this comparatively large study, we focus on mutations within introns. 40 Several observations show that abnormal splicing is a major mechanism for damaging alleles. About 50% 41 of the genetic variants underling NF1 [6] and ATM [7] result in abnormal splicing. Also, more than 50% of 42 the variants associated with human phenotypes in the GWAS catalog [8] are within introns. With the 43 whole genome sequencing data, we are for the first time able to systematically examine the 44 contribution to autism from intronic mutations. 45 In this study, we compare the incidence of de novo mutation within the introns of affected and 46 unaffected children from the SSC, within all genes, and within target genes. Although we see no 47 significant differences over all genes, we find a statistically significant excess of de novo intronic indels in 48 suspected autism target genes. We see no signal from de novo intronic substitutions. We estimate by 49 extrapolation of the known target gene class size that de novo indels in introns of target genes 50 contribute to about 10% of the affected within simplex families. In the Discussion, we further revise 51 upwards our estimate of the total contribution of de novo events to autism. 52 53

Counts and significance of intronic events 55
We have whole genome sequencing from 510 quad families from the Simons Simplex Collection 56 (SSC) [9]. The first 510 families were chosen to have no de novo LGDs or CNVs in the exomes of the 57 children. We catalogued for all de novo substitutions and indels (of size not exceeding 50 bp) using the 58 multinomial genotyper we have previously employed [10]. All ~2000 de novo intronic indels (DIIN) and 59 all ~20,000 de novo intronic substitutions (DISB) are listed in Supplementary Tables I and 2 by event, and  60 by gene in Supplementary Table 3. We did not validate any of the DISB, as previous experience indicates 61 that almost all would be confirmed. We validated several dozen of the DIIN using previous methods [10], 62 and only 4% were false positives, similar to our rates from whole exome sequencing [1], and not 63 sufficiently large to cast doubt on the findings we now describe. 64 The counts of de novo intronic events are summarized in Table 1. These are separated into DIIN 65 (top half of Table 1) and DISB (bottom half of Table 1), as 'events in affected' or 'events in unaffected' 66 siblings. The counts are for events in 'all genes' or divided into classes of genes by the type of target (the 67 rows defined in column 'gene set'), with the 'number of genes' in a target type as tabulated. The first 68 sub-type is called 'affected LGD targets' contains the 546 genes that have been targeted by de novo LGD 69 mutations in 5,000 affected children. We further divide the 'affected LGD targets' in two equal halves 70 based on 'protection'. Protection is the extent to which each of the genes is under purifying selection 71 reflected by the extent of damaging mutations found in the human population [3]. The first half contains 72 the more protected LGD targets ('affected LGD targets, protected') and the second half contains the less 73 protected LGD targets ('affected LGD targets, unprotected'). We analyzed five additional control gene 74 sets defined based on observed de novo missense and synonymous mutation in the ~5,000 affected 75 children or based on observed de novo LGD, missense, and synonymous mutations in ~2,000 unaffected 76 children. The difference in counts of events between discordant siblings is called 'delta'. 77 The remaining columns reflect three distinct methods for determining the significance of the delta. 78 The first method (column 'chi2 p-value') is based on a chi-square test. The second and third methods are 79 based on 10,000 permutations to develop empirical distributions on delta for each row. The p-value is 80 the proportion of permuted deltas that were greater or equal to the empirically observed delta. For the 81 column 'status perm. p-value' in each permutation we randomly assign the affected and unaffected 82 status labels of sibling pairs. In the column 'gene perm. p-value', we randomly select genes with similar 83 cumulative intron length. The second and third methods are meant to guard against outlier families and 84 outlier genes, respectively, which could give rise to spurious statistical significance in the first method. 85 All three methods are in good agreement. See Table 1 legend and methods for additional details. 86

Signal from indels in likely autism genes 87
The counts for DISB in all genes are 10,301 and 10,465 for affected and unaffected, respectively, 88 with a delta of -164. Clearly, these are not significantly different. The rates average to 1.2*10 -8 per highly 89 covered base pair per child, a number in keeping with previous rates for de novo mutation over the 90 whole-genomes [11][12][13][14][15][16]. The counts for DIIN in all genes are 1006 and 945, with a delta of 61, also 91 without statistical significance ( Table 1). The ratio of de novo indels to substitutions, about 1:10, is 92 similar to the ratio we had previously observed over exomes [1]. In sharp contrast to LGD exon targets in affecteds, we observe no consistent signal for DIIN within 108 gene subsets comprised of de novo LGDs exon targets in siblings, or de novo missense or synonymous 109 substitutions in affected or unaffected siblings. These results are consistent with the hypothesis that 110 there will be little enrichment for autism target genes in these sets. We also observe virtually no signal 111 for DISB for any subset. 112

Searching for explanation 113
None of the events were close to the canonical splice sites: the minimum distance to the site for 114 the de novo indels in affected LGD targets of affected children was 83bp and the majority of events 115 were many kilobases inside the introns (see Table 2). We should note here that the 510 affecteds were 116 chosen to have no mutations of the canonical splice sites that would be observable by exome 117 sequencing. Otherwise we would expect an additional delta of ten de novo events hitting the canonical 118 sites. 119 Almost all the observed indels in affected LGD targets are quite small (see Table 2), with most being 120 of length 1 or 2 nucleotides. The proportion of DIINs with size larger than 2bp in the autism target genes 121 in affected children (25/63 = 40%) is larger than the proportion of such events in the unaffected children 122 (12/37 = 32%) but the difference is not significant by Fisher exact test. 123 About 10% percent of intronic space falls within 5'UTRs or 3'UTRs. The rest of the introns are 124 between protein coding exons (CDintrons). Significant difference in the delta for DIINs was only seen in 125 the CDintrons, perhaps because of the small size of the former. In the hope of finding clues to their mechanism of action, we further searched properties of the 128 DIINs. We examined several numerical properties that could reasonably be hypothesized to point to 129 contributory events. These properties were related to the lengths of the affected introns, the proximity 130 of the mutation site to consensus splice sites, the degree of conservation at the mutated site, the 131 likelihood of creation of a new splice site, and the length of the largest open reading frame at that site. 132 The latter might indicate the possibility that the mutation affected an unannotated exon. We associated 133 all de novo intronic events (both indels and substitutions) with each of the above properties, and then 134 asked if the distributions of these properties differed significantly among subsets of the de novo events. 135 These subsets included type (indel or substitution), the affected status of the child, and the target gene 136 class (e.g., 'all genes' and 'affected LGD targets'). None of our efforts were rewarded with a statistically 137 significant signal, but our observations, some positive, are reported in the Supplement. 138

139
Once it was shown that germline copy number variation contributes to autism, exome studies 140 became the method of choice to explore germline contribution in greater detail. From exome 141 sequencing, many excellent candidate autism genes have been identified. On the order of 30% of 142 simplex autism is caused in whole or in part by missense, nonsense, splicing or frameshift mutations and 143 large copy number events. Whole genome studies were delayed in part by expense, in part because we 144 cannot predict which noncoding variants alters gene function. However, now that we have good lists of 145 likely autism genes WGS has been performed, in the hopes that statistical signal would emerge by 146 restricting attention to just those genes. There is, moreover, the hope that we will learn which and how 147 noncoding variants alter gene function. 148 We focused first on intron mutations as there is precedent from previous work that disruption of 149 splicing is frequently a cause for genetic disorders. Although we can infer that the great majority of 150 events within the introns of target genes appear harmless, especially substitutions, we observed a 151 significant excess of de novo indel mutations in affected compared to unaffected siblings. We do not see 152 significant signal for the remainder of the genome, an indication that restricting to likely autism genes 153 matters, and secondarily that the lists of autism genes are good. Autism gene lists further pruned by 154 evidence of negative selective pressure are better still. 155 Many of the observed de novo indels are only a single nucleotide shift (median = 2, maximum = 47). 156 We see an increase in the indel size in affecteds vs unaffected, but it is not significant. Given the small 157 size of indels, we were a little surprised to see no significant signal coming from de novo substitution 158 events in those introns. However, de novo substitutions are ten times more common than indels, and a 159 larger proportion of substitutions are likely to be harmless, so signal from them is more likely to be 160 hidden in noise. Additionally, an indel could potentially cause a substantial alteration in the 161 conformation of RNA or DNA that may propagate for several nucleotides, or perhaps longer, creating a 162 structure that might not be recognized by a binding protein, whereas the effect of a substitution is more 163 likely to be very local. 164 Our entire signal falls within the introns between coding exons. We infer from this that they do 165 indeed disrupt splicing, but we have no direct demonstration of this. All of our attempts to find 166 statistical evidence for known molecular mechanisms yielded nothing of significance. The indels are 167 generally deep within the introns. Not only do they not occur at the consensus splice sites, but they are 168 far clear of them. They do not appear to create new 3' or 5' splice sites, nor disrupt cryptic open reading 169 frames, nor disrupt any of the highly conserved elements within introns identified through comparative 170 genomics . So, although the introns appear to be full of sensitive "targets", we fail to see a predominant 171 explanation, one that yields statistical significance. We feel that how these mutations act is now an open 172 question. Are they interfering with splicing, or targeting control regions? This uncertainty invites future 173 attention as we try to understand the molecular biology of the gene. 174 We are also now in a position to better estimate the overall contribution of germline mutation to 175 autism diagnosis. 26 more intronic indels occur within the 546 LGD target genes (Table 1) in the affected  176 vs unaffected. There are 510 discordant siblings, so we infer that as many as 5% (26/510) have a 177 diagnosis of autism in part due to de novo intronic indels. From the whole-exome studies we have 178 estimated that only about half of the affected LGD targets are true autism genes and that the number of 179 true autism genes is about 500. These enable us to extrapolate as many as ~10% of the SSC children 180 would have autism due to de novo intronic indels in autism genes. The observed delta of 61 of de novo 181 intronic events in all genes supports that extrapolation. It is almost assured that other de novo intronic 182 events like substitutions, microsatellite expansions, and indels of sizes larger than we can presently 183 detect also contribute to the disorder. If such presently cryptic events contributed in an amount about 184 equal to small de novo indels in introns, the total contribution would be about ~20%. This figure is only 185 slightly less than our estimates of the contribution from de novo missense, nonsense, and frame-shifts 186 combined. substitutions that fall in introns separating coding exons. These numbers are tabulated separately for de novo intronic indels (DIIN) and 198 substitutions (DISB), by affected and unaffected children, and by nine subsets of genes. Column 'gene set' lists the nine gene sets, six of which 199 have been defined based on de novo LGD, missense, and synonymous mutations detected in ~5,000 children with autism and ~2,000 unaffected 200 siblings. We analyzed the set of all human genes ('all genes'). 'Affected LGD targets' refers to the genes targeted by de novo LGD mutation in 201 the ~5,000 affected children. We further split these into two halves, based the degree to which each gene tolerates damaging mutation [3]: the 202 more protected LGD targets ('affected LGD targets, protected') and the less protected LGD targets ('affected LGD targets, unprotected'). Column 203 'number of genes' indicates the number of genes in each set. Columns 'number in affected' and 'number in unaffected' show the number of de 204 novo intronic events that fall in the row-specific gene set in affected and unaffected children, respectively, and 'delta' shows the difference 205 between these two numbers. 206 The last three columns show p-values by three different methods for testing if the number of events in affected and unaffected children is 207 significantly different than the expectation of equality. 'chi2 p-value' is the result of a chi-square test comparing the two event numbers in each 208 row to the two event numbers for 'all genes' in DISB. The 'status perm. p-value' and 'gene perm. p-value' columns show the results of two 209 permutation tests. The first based is based on random swapping of the affected and unaffected labels for the discordant sibling pairs. The 210 second is based on the replacement of each gene in the set with a selection from all genes one with a similar cumulative length of introns. 211 However, to control for coverage fluctuation, we actually used the cumulative number of ultra-rare substitutions in parents (see Supplementary  212 Methods for more details). 213

Chi square test 237
This test compares the two de novo intronic event numbers in affected vs. unaffected children for a 238 given target gene class (e.g., 'affected LGD targets') to the two event numbers for 'all genes' in DISB. 239

Status permutation method. 240
It is a permutation test based on random swapping of the number of de novo intronic events for 241 the discordant sibling pairs (affected vs. unaffected) for a given target gene class. 242

Gene permutation method. 243
It measures the significance of observed difference in the number of de novo intronic events in 244 affected and in unaffected children. In this method, we select genes with similar intron lengths as the 245 genes in the analyzed gene set. As a measure of intronic lengths we used the number of ultra-rare 246 substitutions (variants seen only once in the 1020 parents). The total length of the introns in a gene 247 (measured using RefSeq gene model databases) and the number of ultra-rare intronic substitutions are 248 linearly related, but we chose to use the number of intronic substitutions because it accounts for the 249 coverage in the whole-genome data (Table S3 shows the intron lengths and the number of ultra-rare 250 substitutions for each gene). 251 To select random gene set of genes with similar number of ultra-rare intronic substitutions as the 252 analyzed set, we first sorted all the genes based on the number of ultra-rare intronic substitutions. Then 253 for each of the analyzed genes we selected randomly either the previous or the following gene from the 254 sorted list of genes. 255

Searching for explanation 256
We observed that in the affected children there were significantly more de novo intronic indels in 257 the autism targets genes than in the unaffected children. We inferred that the increase is due to the 258 indirect ascertainment of intronic indels that contributed to diagnosis of autism in the affected children 259 and we asked the natural question if the contributory de novo intronic indels could be distinguished 260 from the non-contributory events by some of their properties. We examined 15 numerical properties 261 (see the detailed list and description below) that could reasonably be hypothesized to point to 262 contributory events. We associated all de novo intronic events (both indels and substitutions) with each 263 of We also performed the corresponding tests for the de novo intronic substitutions and the six p-values 274 computed using ranksum tests for all properties are shown in Table S5. More detailed view of the 275 distributions of each of the properties over the various classes of events can be seen in the 276 Supplementary Figures 3-17. 277

Properties 278
Intron length and distance to the nearest splice-site 279 For every de novo intronic variant we identified the shortest intron covering the variant. We 280 recorded the length of the shortest intron ('intron length' property; see Table S4). We also recorded the 281 distance between the de novo event and the splice-sites of the shortest intron that was closest to the 282 observed event ('distance from splice-site' property). We assigned positive number if the closer splice-283 site was the donor splice-site and negative number if the closer splice-site was the acceptor splice-site. 284 We tested if the absolute value of the distance from splice-site was different between the various 285 classes of the de novo mutations ( Figure S3). 286

Open Reading Frame length 287
To test if the de novo intronic events fall in and disrupted cryptic coding exons, we looked for a bias 288 in the size of the largest open reading frame in the direction of transcription (see 'ORF length' property') 289 among the difference lasses of de novo events ( Figure S5). 290

Conservation scores 291
We used two methods for measuring conservation: phastCons [1] and phyloP [2]. The two methods 292 compute a conservation score for each genomic location based on a given phylogenetic three. We 293 downloaded the computed scores from the two methods over three different phylogenetic trees: 294 vertebrate, placental, and primates from UCSC genome browser. (Figures S12-S17). 295

Novel splice site scores 296
To test if the de novo intronic mutations created novel splice sites we developed a donor and an 297 acceptor splice-site sequence scores for a given short sequence (see below for detailed definition of the 298 scores). We computed these two scores for the reference sequence around (5 bases up and  299 downstream) the location where the de novo event occurred ('ref' scores) and separately for the local 300 sequence after the de novo event was introduced ('alt' scores). We also computed the differences 301 between the 'alt' scores and the 'ref'. Thus, every de novo intronic mutation was associated with six 302 splice-site sequence scores: 'ref', 'alt', 'alt-ref' for both donor and acceptor splice-site scores (Tables S2  303 and S3) and we tested each of the six scores for their ability to separate de novo intronic events in 304 affected children in target genes (Supplementary Table 5 and Supplementary Figures 6-11). 305 Definition of the donor and acceptor splice-site sequence scores 306 We defined a position-specific sequence models for donor and acceptor splice sites based on 20bp 307 sequence context (10bp upstream and 10bp downstream of the splice site). We measured the frequency 308 of the four nucleotides at each of the 20 positions independently using the ~200,000 annotated donor 309 and acceptor sites in the RefSeq database: and , where is for donor, is for acceptor, p is 310 index for the position and n is A, C, G, or T. We also measured the frequency of the random intronic 311 nucleotides, ℛ and defined the position specific donor and acceptor splice-site scores as log-likelihood 312 ratios: 313 DS(context) = log  Table S1 and S2: Lists of de novo intronic indels (S1) and substitutions (S2) 327 The two tables S1 (Supp-T1-DN-indel.xlsx data file) and S2 (Supp-T2-DN-sub.xlsx data file) list all 328 analyzed de novo intronic events, 2,231 indels and 23,715 substitutions, respectively. For each event the 329 tables lists: the 'family' and the child ('in child) where the de novo events are found (prb -is the 330 proband or affected child, sib is for the unaffected sibling, M for male and F for female; some events are 331 shared between the two siblings); the detail description of the variant using VCF conventions ('variant' 332 with <chr>:<pos>:<reference allele>:<alternative allele> format, the location <chr>:<pos> in hg19 333 coordinates) and the 'variant size' (0 for substitutions, negative number for deletion and positive 334 number for insertions); the 'gene' affected by the variant and the 'variant effect' (CDintron for coding  335 introns, 5Uintrons or 3Uintrons). The table also shows if the affected gene is a member of one of the 8 336 analyzed gene classes (the purple columns) and the 15 analyzed properties of de novo intronic events 337 (blue columns). See Supplementary methods for a description of those properties. 338 Table S3: Gene Table  339 This table is in the Supp-T3-genes.xlsx data file and shows information about the 23,953 annotated 340 human genes. For each gene, the table lists the 'gene' name, gene protection information as reported in 341 [3] (red columns); lengths of the intronic space for each of the three classes of introns computed from 342 the RefSeq gene model database (blue columns); the number of ultra-rare (UR) events by type of the 343 events (sub for substitution, del for deletion, ins for insertion) and by the type of the affected intron 344 (CDintron, 5Uintron, or 3Uintron) (yellow columns); the number of de novo intronic events by the 345 affected status of the child, the type of de novo event and by the type of the affected intron (green 346 columns); and the membership of the gene in each of the 8 genes sub-classes defined by the affected 347 'status' of the child carrying the de novo events (affected or unaffected), by the effect of the de novo 348 event, and based on the degree of protection of the affected gene (purple columns). 349    Each of the Figures S3 to S17 corresponds to a property of de novo intronic events (see Table S5  See the legend of Figure S3. 483