A systematic screen of breast cancer patients’ exomes for retrotransposon insertions reveals disease associated genes

Background Retrotransposons are genetic elements that jump within the genome via an RNA intermediate. Although they had a strong impact on human genome evolution, only a very tiny fraction of them can be reactivated nowadays, most often with neutral or detrimental consequences. The pathological outcomes associated with such genetic alterations are poorly investigated in the clinic, merely due to their difficult detection. Results We developed a strategy to detect rare retrotransposon mediated insertions in Whole Exome Sequencing data from 65 familial breast cancer patients. When restricting our search to high confidence retrotransposition events occurring in less than 10% of the samples, we identified only ten different Alu elements, two L1 elements, one SVA and two processed pseudogenes. Only two of these insertions occurred within protein coding sequences and interestingly, several of the targeted genes have been previously linked to cancer, in three cases even to increased breast cancer risk (GHR, DMBT1 and NEK10). When investigating the molecular consequences of four Alu insertions at the mRNA level, we found that the element present in the 3’UTR of GHR repressed expression of the corresponding allele. oMreover, the analysis of a near exonic Alu insertion in PTPN14 (a mediator of P53 tumor suppressor activity) revealed that this gene was imprinted and that the presence of an intronic Alu element can lead to loss of imprinting. Conclusions Our data underline the relevance of incorporating the search for uncommon retrotransposition events in Next Generation Sequencing pipelines when analyzing patients with a suspected genetic disease.


IN SILICO screen for retrotransposition mediated insertions 93
Because of their shared mechanisms of transposition using an RNA molecule as intermediate, 94 retrotransposons all bear a long polyA tail at their 3' extremity. Also, their specific mechanism 95 for insertion into the target DNA results in the duplication of a short sequence at the position 96 where the retrotransposon has integrated. This duplicated sequence (called Target Site 97 Duplication or TSD) is usually between 4 and 20 nucleotides long and flanks the retrotransposon 98 specific sequences. We exploited those two characteristics to identify retrotransposon insertions 99 not present in the human reference genome (hg19) using WES data generated with the SeqCap 100 EZ Exome v3.0 kit from Roche. Paired-end sequencing usually generates read pairs that 101 perfectly map in close proximity on a reference genome. However, the presence of a non-102 reference retrotransposon insertion will result in read pairs that do not map closely (Figure1). We  High confident retrotransposition mediated insertions 115 When performing a manual selection (using the Integrative Genomics Viewer: IGV [19]) for the 116 presence of a TSD on the 32 polyA containing insertions we had identified, we ended up with 18 117 high confidence candidates retrotransposon insertions (Table 1). The further characterization of 118 these retrotransposition mediated insertions (12 Alu, 2 L1, 1 SVA and 3 pseudogenes, see Table   119 1) was made possible by the presence of a stretch of nucleotides (maximum ~50bp long) in a 120 fraction of the aligned read pairs that do not match with the reference sequence, but  Table 1) which might be surprising since WES probes are expected to 127 be specifically designed towards exonic sequences. Indeed, according to the kit provider (Roche) 128 only protein coding parts of the transcripts are targeted, but for exons that are smaller than 100 129 bp, the target region is extended to 100 bp. Nevertheless, we expect that the large majority of 130 retrotransposition mediated insertions occurring in the well covered protein coding regions will 131 be picked up with our screen, their low incidence being in agreement with the literature [5,20,132 21]. Conversely, we are convinced that a significant fraction of the polymorphic retrotransposons 133 located in the vicinity of protein coding regions (regions often poorly or not covered at all by 134 WES) is missed with our WES based approach, although the large majority of identified 135 insertions (15/17, positive control excluded) are located in intronic, 3'UTR, promoter or even 136 intergenic sequences. As a sufficient coverage at sequencing is essential to allow the 137 identification of polymorphic retrotransposons in the non-coding regions, we wanted to estimate 138 the number of false negative carriers of such highly confident retrotransposition mediated 139 insertions among the 65 samples we investigated. Therefore, all samples were rescreened for the 140 presence of the 18 IN SILICO identified insertion types using IGV (see Table 1). All insertions 141 identified with the IN SILICO pipeline could be confirmed with IGV analysis (no false 142 positives). As expected, IGV inspection revealed higher rates of carriers for 9 insertions (Table   143 1). In two cases, insertion carriership was even observed in 50% of the samples, with IGV and 144 IN SILICO data discrepancies being well explainable by the low coverage at sequencing. Of 145 note, about half of the high confident retrotransposon insertions we identified are reported in the 146 database of retrotransposon insertion polymorphisms in human (dbRIP, see Table 1).

148
Retrotransposon insertions in non-coding regions can affect allelic expression 149 To investigate the potential consequences of Alu insertions located in non-coding gene regions at 150 the molecular level[22], we collected blood samples from available carriers (and non-carriers for 151 control) and extracted the corresponding RNA material. Since allele specific expression levels 152 can be monitored by Sanger sequencing only when the Alu carrier is also heterozygous for a 153 coding SNP in the same gene, the number of suitable insAlu carriers among the 65 investigated 154 patients was not always sufficient and additional familial BC patients were genotyped using 155 mutation specific hemi-nested PCRs (Table 2). Sufficient blood samples could be collected to 156 initiate the analysis of 5 Alu targeted genes (GHR, GSTA5, NEK10, PTPN14 and UPF2).

157
Sequence analyses of the cDNA regions flanking the allele discriminating SNPs revealed that the 158 intronic Alu insertions in UPF2 and NEK10 do not result in an obvious decrease or increase of 159 the mRNA levels generated from the retrotransposon targeted allele (Figure 3). In sharp contrast, 160 the Alu insertion in the 3'UTR of GHR clearly resulted in allele silencing ( Figure 4A) suggesting 161 that this Alu sequence may contain sites for microRNA directed degradation, a mechanism 162 previously proposed to explain the occurrence of evolutionary stabilized Alu derived microRNA 163 binding sites in the 3'UTRs of specific gene [23]. Conversely, the two available carriers of the 164 intronic Alu insertion in PTPN14 seem to express both alleles equally well but among the three 165 non-carriers, one expresses virtually only one allele while the two others show a reduced 166 expression of this same allele ( Figure 4B). Partial or near absolute mono-allelic expression of 167 PTPN14 is in agreement with a study listing this gene as a high-confidence imprinted human 168 gene candidate with maternal expression [24]. On the other hand, the apparent loss of imprinting 169 (LOI) observed in the Alu carriers suggests that the integration of one single Alu element into the 170 gene sequence of the imprinted allele would be sufficient to fully reactivate this allele, which 171 matches with earlier observations reporting a sharp inflection in SINE content at transitions from 172 imprinted to non-imprinted genomic regions [25]. Unfortunately, differential expression of the 173 GSTA5 Alu containing allele (insertion occurred in the promoter region) could not be 174 investigated due to the very low expression level of GSTA5 in leucocytes compared to GSTA1 175 and GSTA2 (see GeneCards.org), three homologues that are hard to discriminate when using 176 PCR based assays (only GSTA1 and/or GSTA2 sequences could be detected after Sanger 177 sequencing even by using gene discriminating primers for PCR, results not shown). In addition to the 18 high confident retrotransposition mediated insertions for which we could 181 identify a TSD (described above), our IN SILICO pipeline also identified 14 candidate insertion 182 sites for which a TSD could not be deduced using IGV (see Additional file 1 and Table 3). A 183 summary of each of the targeted gene's function is presented in Additional file 2 (see also Table   184 4 for PubMed hits). A lower coverage at sequencing most probably explains the problematic 185 discovery and characterization of these polyA containing insertions. However, for eight of them 186 we found parts of retrotransposon sequences (7x Alu, 1x L1) by analyzing the mates of the 187 discordantly mapped mate pairs (Table 3). In addition, 7 out of these 14 expected  (Table   192 3). The high rate of confirmed retrotransposon mediated insertions observed among the 193 candidate polyA containing insertions we identified (28 out of 32) is a good indicator for the 194 high specificity of the followed screening strategy.  Table 2) detected this variant in only 205 one out of 710 familial BC patients. Since this patient has North African roots, we cannot 206 exclude that the mutation is more prevalent in that region. ZNF442 has been poorly investigated 207 till now (see outcomes of PubMed searches for the different targeted genes in Table 5) although 208 the gene has been reported as a good candidate driver gene for the neoplastic process of breast 209 and colorectal cancer [27]. The second identified retrotransposition mediated event that destroys 210 protein integrity is the UQCR10 transcript inserting into the C1orf194 coding sequence (creating 211 a UQCR10 processed pseudogene). This polymorphic insertion has been previously observed[7,  (Table 1). Little is known about these two genes, but none 217 has been linked to breast cancer yet. Interestingly, two seemingly dominant heterozygous  The large majority of high confident retrotransposon insertions we could identify were located in 225 non-coding regions (15 out of 17). However, such regions are not the primary targets when 226 performing WES and consequently their coverage at sequencing will often be (very) low, or they 227 will not be covered at all. Therefore, we believe much more intronic (near exonic) and UTR  Our study indicates that raw WES data obtained from clinical samples (whose availability is 277 exponentially growing) hide a manageable number of retrotransposition mediated polymorphic 278 mutations that can be dug up when using appropriate IN SILICO tools. A significant fraction of 279 these mutations will affect gene function, even when located outside protein coding regions, and 280 consequently may be (one of) the pathogenic factor(s) responsible for a patient's disease. Our 281 investigation was restricted to 65 non BRCA1/2 familial BC patients, but much more patients   (57) belongs to families with at least two first 320 degree blood relatives with BC. The FastQ files generated during this previous study were reused 321 for the present study. The genomic DNA from the positive control was resubmitted for WES 322 analysis using the SureSelect Human All Exon V6 kit from Agilent (performed by BGI 323 Genomics Co, China) to compare both WES approaches. In addition, blood samples obtained 324 from 710 probands from independent non BRCA1/2 BC families (including the 65 probands 325 used for the WES analysis) were used for a PCR based genomic screen to identify additional 326 carriers of Alu element insertions in case insufficient samples were available for RNA analysis.

376
To identify and in a second step discard the clusters (genomic regions) picked up in a large 377 fraction of individuals (e.g. false positives resulting from technical artefacts, or highly recurrent 378 polymorphic insertions not present in the reference genome), the Excel outputs obtained from 379 each BC patient were pooled (keeping patient ID tracked) and clusters were again generated.

380
Candidate genomic insertion sites detected in more than 10% of the samples (corresponding to 381 clusters with more than 6 units for the BC cohort) were identified and subtracted from the patient 382 specific output data. After this filtering step, on average 9 clusters with 5 or more reads are 383 obtained for each individual. Visual inspection using the IGV software revealed that the majority 384 of the remaining candidate insertion sites are generated by the presence of a polyA stretch in the 385 reference sequence that resulted in DNA polymerase slippage during the NGS process.

386
Consequently, these candidate genomic insertion sites were manually tracked and listed, and  Remove all lines not referring to an autosome or to chX (all samples are from females).

421
Sort (A>Z) according to sign in last (6 th ) column.

422
Select/copy all filled cells and paste in Sheet 1 ("calculate cluster") of the preformatted Excel file 423 for cluster calculation (Additional file 5).

424
In this Sheet 1, select for "TRUE" in column I (OR) and select/copy all filled cells in columns A 425 to J.

426
Paste the data in Sheet 2 ("simplify cluster"), and in column A, order by "color". Delete the 427 entire rows corresponding to the cells with pink background (this will remove duplicates).

428
In column O (chr), deselect "blank" and select/copy all informative cells in columns O to U.

429
Paste these data in Sheet 3 ("all clusters"), and deselect "2" in column G (to keep clusters with 430 minimum 3 reads).

431
Select/copy all filled cells and paste in Sheet 4 (annotation list, use the first column for sample 432 ID). The positive control sample will generate 641 clusters (see Additional file 6). to identify and discard the less relevant clusters from the patient specific cluster lists (as 441 described above). This list is provided with the only intension to familiarize with the described 442 procedure. The content of this list can depend on the type and number of samples used for its 443 generation (type of disease, ethnicity…) but also on the followed wet and dry bench procedures.

445
Select/copy all filled cells of Sheet 4 (annotation list) obtained from a patient specific cluster 446 calculation (see above).

447
Open Sheet 1 (comm65>=6 + sample) of the preformatted Excel file for cluster filtering 448 (Additional file 7) and paste the data (in column A) down the preexisting list.

449
Apply Sort (A>Z) on column C (start) and thereafter on column B (chr).

453
Select/copy/paste clusters with five or more reads to the "results" Sheet 3. within or close to exonic gene sequences or promoter sequences (7 in total, see Table 1)

487
To increase the number of samples that could be enrolled in the allele expression study, the 488 patients whose genomic DNA revealed an interesting Alu insertion upon mutation specific PCR 489 analysis were genotyped for the corresponding polymorphic site. PCR primers are presented in 490        All Exon V6 kit (exome capture kit) from Agilent. Group alignment is by chromosome of mate.

760
Note that this exome capture kit generates much less discordantly mapped read pairs than the kit 761 from Roche (2 versus 22, compare with Figure 2). Accordingly, the required number (five) of 762 discordantly mapped read pairs containing a long polyA track will not be reached and the 763 insertion will not be revealed with our detection strategy. Note that all other characteristics 764 observed for a retrotransposon insertion when using the exome capture kit from Roche (see 765 Figure 2) are also observed when using the kit from Agilent.   The primers used for mutation specific PCR amplification are presented in Table 6 787 788 789 790 791 792 Note that for 8 of the 14 polyA containing insertion sites, a partial retrotransposon sequence 794 could be traced in the mate of at least one discordantly mapped read upon IGV inspection (  This table shows the number of hits obtained in PubMed when using as search term a specific 801 gene or its synonym (= "all gene names") in combination with "cancer" or "breast cancer" and  This table shows the number of hits obtained in PubMed when using as search term a specific 808 gene or its synonym (= "all gene names") in combination with "cancer" or "breast cancer" and 809 when restricting the search to specific domains (title or abstract). a gene names generating