Extensive gene duplication in Arabidopsis revealed by pseudo-heterozygosity

Background It is apparent that genomes harbor massive amounts of structural variation, and that this variation has largely gone undetected for technical reasons. In addition to being inherently interesting, structural variation can cause artifacts when short-read sequencing data are mapped to a reference genome. In particular, spurious SNPs (that do not show Mendelian segregation) may result from mapping of reads to duplicated regions. Calling SNP using the raw reads of the 1001 Arabidopsis Genomes Project we identified 3.3 million heterozygous SNPs (44% of total). Given that Arabidopsis thaliana (A. thaliana) is highly selfing, we hypothesized that these SNPs reflected cryptic copy number variation, and investigated them further. Results The heterozygosity we observed consisted of particular SNPs being heterozygous across individuals in a manner that strongly suggests it reflects shared segregating duplications rather than random tracts of residual heterozygosity due to occasional outcrossing. Focusing on such pseudo-heterozygosity in annotated genes, we used GWAS to map the position of the duplicates, identifying 2500 putatively duplicated genes. The results were validated using de novo genome assemblies from six lines. Specific examples included an annotated gene and nearby transposon that, in fact, transpose together. Finally, we use existing bisulfite sequencing data to demonstrate that cryptic structural variation can produce highly inaccurate estimates of DNA methylation polymorphism. Conclusions Our study confirms that most heterozygous SNPs calls in A. thaliana are artifacts, and suggest that great caution is needed when analyzing SNP data from short-read sequencing. The finding that 10% of annotated genes exhibit copy-number variation, and the realization that neither gene- nor transposon-annotation necessarily tells us what is actually mobile in the genome suggest that future analyses based on independently assembled genomes will be very informative.

used these data to identify large numbers of structural variants using split reads (Göktay, Fulgione, and Hancock 2020;Zmienko et al. 2020;D.-X. Liu et al. 2021). Here we approach this from a different angle. Our starting point is the startling observation that, when calling SNPs in the 1001 Genomes data set, we identified 3.3 million (44% of total) putatively heterozygous SNPs. In a highly selfing organism, this is obviously highly implausible, and these SNPs were flagged as spurious: presumably products of cryptic CNV, which can generate "pseudo-SNPs" (Ranade et al. 2001;Hurles 2002) when sequencing reads from non-identical duplicates are (mis-)mapped to a reference genome that does not contain the duplication. Note that allelic SNP differences are expected to exist ab initio in the population, leading to instant pseudoheterozygosity as soon as the duplicated copy recombines away from its template. In this paper we return to these putative pseudo-SNPs and show that they are indeed largely due to duplications, the position of which can be precisely mapped using GWAS. Our approach is broadly applicable, and we demonstrate that it can reveal interesting biology.

Results
Massive pseudo-heterozygosity in the 1001 Genomes data Given that A. thaliana is highly selfing, a large fraction (44%) of heterozygous SNPs is inherently implausible. Two other lines of evidence support the conclusion that they are spurious. First, genuine residual heterozygosity would appear as large genomic tracts of heterozygosity in individuals with recent outcrossing in their ancestry. Being simply a random product of recombination and Mendelian segregation, there is no reason two individuals would share tracts unless they are very closely related. The observed pattern is completely the opposite. While a small number of individuals do show signs of recent outcrossing, this is quite rare (as expected given the low rate of outcrossing in this species, and the fact that the common (shared) transposon insertions near centromere (Quadrana et al. 2016). As we shall see below, it is likely that transposons play an important role in generating cryptic duplications leading to pseudo-heterozygosity (although we emphasize again that the heterozygous SNPs were called taking known repetitive sequences into account).
Despite the evidence for selection against these putative duplications, we found 2570 genes containing 26647 pseudo-SNPs segregating at 5% or more in the population (Supplemental Figure 3). Gene-ontology analysis of these genes reveals an enrichment for biological processes involved in response to UV-B, bacteria or fungi (Supplemental Figure 4). In the following sections, we investigate these putatively duplicated genes further.
Mapping common duplications using genome-wide association If heterozygosity is caused by the presence of cryptic duplications in non-reference genomes, it should be possible to map the latter using GWAS with heterozygosity as a "phenotype" (Imprialou et al 2017). We did this for each of the 26647 SNPs exhibiting shared heterozygosity within genes (Supplemental Figure 3).
Of the 2570 genes that showed evidence of duplication, 2511 contained at least one major association (using significance threshold of p < 10 -20 ; see Methods). For 708 genes, the association was more than 50 kb away from the pseudo-SNP used to define the phenotype, and for 175 it was within 50 kb. We will refer to these as trans-and cis-associations, respectively.
The majority of genes, 1628, had both cis-and trans-associations (Figure 2). phenotype, heterozygosity at the position of interest is coded as 1 (present) or 0 (absent). As genotype, the SNPs matrix of the 1001 genome dataset was used (with heterozygous SNPs filtered out). Color gradients represent the strength of LD around the two loci. In this example the reference genome does not contain locus2. (B) GWAS results for three different genes with evidence of duplication, for illustration.
The red lines indicate the position of the pseudo-SNP used for each gene/GWAS and the thick grey lines indicate the centromeres. The top plot shows a trans-association, the bottom a cis-association, and the middle shows a case with both (cis plus two trans). (C) Summary of all 26647 GWAS results.
To validate these results, we assembled 6 non-reference genomes de novo using long-read PacBio sequencing. The GWAS results provide predicted locations of the duplications (the putative causes of pseudo-heterozygosity). We identified the homologous region of each nonreference genome, then used BLAST to search for evidence of duplication. For 84% of the 403 genes predicted to have a duplication present in at least one of the six non-reference genomes, evidence of a duplication was found; for 60%, the occurrence perfectly matched the pattern of 6 120 heterozygosity across the six genomes. For the remaining 16%, no evidence of a duplication was found, which could be due to the stringent criteria we used to search for evidence of duplication (see Methods). The distribution of fragment sizes detected suggests that we capture a mixture of duplicated gene fragments and full genes (Supplemental Figure 5).

Rare duplications
The GWAS approach has no power to detect rare duplications, which is why we restricted the analysis above to pseudo-heterozygous SNPs seen in five or more individuals. Yet most are rarer: 40% are seen only in a single individual, and 16% are seen in two. As it turns out, many of these appear to be associated with more common duplications. Restricting ourselves to genes only, 11.4% of the singleton pseudo-heterozygous SNPs are found in the 2570 genes already identified using common duplications, a significant excess (p = 2.5e-109). For doubletons, the percentage is 11.1% (p = 1.9e-139). Whether they are caused by the same duplications, or reflect additional ones present at lower frequency is difficult to say. To confirm duplications more directly, we took the reads generating the singleton and doubleton pseudo-heterozygotes, and compared the result of mapping them to the reference genome, and to the appropriate genome (derived from the same inbred line). One predicted consequence of the reads mapping at different locations is that mapping coverage around the pseudo-SNPs will be decreased when mapping to the newly assembled PacBio genomes rather than the reference genome. As expected, a high proportion of the SNPs tested have lower coverage when mapping to the PacBio genomes (Supplemental Figure 6-7). In addition to a decrease in coverage, we were also able to detect reads mapping to multiple locations in the right genomes, as well as the corresponding disappearance of the pseudo-SNPs. For example, 41.5% of the doubletons tag regions that map to more regions in the PacBio genomes than in the reference genome (Supplemental Figure 6-8).

Local duplications
If duplications arise via tandem duplications, they will not give rise to pseudo-SNPs until the copies have diverged via mutations. This is in contrast to unlinked copies, which will lead to pseudo-SNPs due to existing allelic variation as soon as recombination has separated copy from original. We should thus expect the approach taken here to be biased against detecting local duplications. Nonetheless, GWAS revealed 175 genes with evidence only for a cis duplication. 28 of these were predicted to be present in at least one of the six new genomes, and 14 could be confirmed to have local variation of copy number relative to the reference.
( Figure 3A). The local structure of the duplications can be complex. An example is provided by the gene AT1G31390, annotated as a member of MATH/TRAF-domain genes, and which appears to be present in 4 tandem copies in the reference genome, but which is highly variable between accessions, with one of our accessions carrying at least 6 copies ( Figure 3B). However, there are no copies elsewhere in any of the new genomes for this gene (Supplemental Figure 9).

Transposon-driven duplications
Transposons are thought to play a major role in gene duplications, capturing and moving genes or gene fragments around the genome (Woodhouse, Pedersen, and Freeling 2010;Lisch 2013).
While confirming the trans duplications in the PacBio genomes, we found a beautiful example of this process. The gene AT1G20400 (annotated, based on sequence similarity, to encode a myosin heavy chain-like protein) was predicted to have multiple trans-duplications. The 944 bp coding region contains 125 putatively heterozygous SNPs with striking haplotype structure characteristic of structural variation ( Figure 4C). We were able to identify the duplication have only one copy of the gene, just like the reference genome, but one has 3 copies and one has 4 copies. However, none of the 6 new genomes has a copy in the same place as in the reference genome (Supplemental Figure 10).
In the reference genome, AT1G20400 is closely linked to AT1G20390, which is annotated as a Gypsy element. This element also contains many pseudo-SNPs, and GWAS revealed duplication sites overlapping those for AT1G20400 ( Figure 4B). This suggested that the putative gene and putative Gypsy element transpose together, i.e. that both are misannotated, and that the whole construct is effectively a large transposable element. Further analysis of the PacBio genomes confirmed that AT1G20400 and AT1G20390 were always found together, and we were also able to find conserved Long Terminal Repeat sequences flanking the whole construct, as would be expected for a retrotransposon (Supplemental Figure 11-12). We did not find any evidence for expression of AT1G20400 in RNAseq from seedlings in any of the accessions. Available bisulfite sequencing data (Kawakatsu et al. 2016) showed that the whole region is heavily methylated, as expected for a transposon (Figure 4). We tried mapping the bisulfite reads to the appropriate genome for the respective accesions, but the coverage was too low and noisy to observe a difference in methylation between the multiple insertions (Supplemental Figure 13). Having located precise insertions in the six new genomes, we attempted to find them using short-read data in the 1001 Genomes dataset. Except for one insertion that was shared by 60% of accessions, the rest were found in less than 20%, suggesting that this new element has no fixed insertions in the genome -including the insertion found in the TAIR10 reference genome, which was only found in 17.4 % of the accessions (Supplemental Figure 14). We also looked for the element in the genomes of A. lyrata (two different genomes), A. suecica (a tetraploid containing an A. thaliana and an A. arenosa subgenome; see Burns et al. 2021), and Capsella rubella (Slotte et al. 2013). The gene and the Gypsy element were only found together in A. thaliana (including the A. thaliana sub-genome of the allopolyploid A. suecica). The Gypsy element alone is present in the other Arabidopis species, and the gene alone is present in A.
lyrata, but only in one of two genomes. In Capsella rubella, neither the transposon nor the gene could be detected (Supplemental Figure 15). Thus the transposon and gene appears to be specific to the genus Arabidopsis, while their co-transposition is specific to A. thaliana, suggesting that the new transposable element evolved since divergence of A. thaliana from the other member of the genus.

Spurious methylation polymorphism
Just like cryptic duplications can lead to spurious genetic polymorphisms, they can lead to spurious cytosine methylation polymorphisms. Indeed, given the well-established connection between gene duplication and gene silencing (e.g., Melquist, Luff, and Bender 1999), they may be more likely to do so. To investigate this, we re-examined the methylation status of genes previously reported by the 1001 Genomes Project (Kawakatsu et al. 2016) as having complex patterns of methylation involving both CG and CHG methylation. In our six sequenced accessions, we found 19530 genes that had been reported as having CG methylation (in at least one accession) and 2556 genes that had been reported as having CHG methylation (in at least one accession). 2473 genes were part of both sets. Out of these, 619, or 24%, had been detected as duplicated in the analyses presented above (a massive enrichment compared to the genome-wide fraction of roughly 10%). To understand these patterns better, we mapped the original bisulfite data to the appropriate genome as well as to the reference genome. In any given accession, roughly 7% of the 2473 genes could not be compared because the largely concordant (roughly 2.5% disagreement), whereas for genes with copy number variation, roughly one third of calls were wrong. As an illustration for why this occurs, consider the methylation status of AT1G30140 ( Figure   5). When mapped to the reference genome, 5 out of 6 accessions were found to be both CG and CHG methylated, with accession 6021 having no methylation. When mapped to the appropriate genome, we see that this pattern can be quite misleading. In accession 1254, for example, we found three apparent copies of the gene, only two of which are methylated, neither of which is the copy corresponding to the copy present in the reference genome. In accession 5856, the copy corresponding to the reference genome cannot be identified, but a copy on a different chromosome is identified, and it is methylated. In both cases, mapping to the reference genome leads to incorrect methylation status for AT1G30140. 14 264 265 Discussion A duplication can lead to pseudo-SNPs when SNPs are identified by mapping short reads to a reference genome that does not contain the duplication. Typically pseudo-SNPs have to be identified using non-Mendelian segregation patterns in families or crosses, but in inbred lines they can be identified solely by their presence. The overwhelming majority of the 3.3 million heterozygous SNPs (44% of total) identified by our SNP-calling of the 1001 Genomes Project (2016) data are likely to be pseudo-SNPs. Assuming this, we used (pseudo-)heterozygosity as a "phenotype", and tried to map its cause, i.e. the duplication, using a simple but powerful GWAS approach. Focusing on annotated genes, we find that over 2500 (roughly 10% of total) harbor pseudo-SNPs and show evidence of duplication. Using 6 new long-read assemblies, we were able to confirm 60% of these duplications using conservative criteria (see Methods). Most of the remaining duplications are located in pericentromeric regions where SNP-calling has lower quality, and which are difficult to assemble even with long-read (Supplemental Figure 16).
These numbers nearly certainly underestimate the true extent of duplication, which has been known to be common in A. thaliana for over a decade Gan et al. 2011;Schneeberger et al. 2011). While unlinked trans-duplications are fairly likely to give rise to pseudo-SNPs, local cis-duplications will only do so once sufficient time has passed for substantial sequence divergence to occur, or if they arise via non-homologous recombination in a heterozygous individual (which is less likely in A. thaliana). As for the GWAS approach, it lacks statistical power to detect rare duplications, and can be misled by allelic heterogeneity (due to multiple independent duplications). Finally, duplications are just a subset of structural variants, and it is therefore not surprising that other short-read approaches to detect such variants have identified many more using the 1001 Genomes data ( you think. One possible consequence of this is incorrect methylation polymorphism calls, as we demonstrate above, but essentially any methodology that relies on mapping sequencing data to a reference genome could be affected (e.g. RNA-seq).
Time (and more independently assembled genomes) will tell how significant this problem is, but the potential for artifactual results is clearly substantial, and likely depends on the amount of recent transposon activity (Morgante et al. 2005). It is also important to realize that the artefactual nature of the 44% heterozygous SNPs was only apparent because we are working with inbred lines. Other researchers working on inbred lines have reached similar conclusions, and used various methods to eliminate them e.g. Zea (Chia et al. 2012;Lu et al. 2015;Bukowski et al. 2018) and Brachypodium (Stritt et al. 2021). In human genetics, SNP-calling relies heavily on family trios, but in outcrossing organisms where this is not possible, there is great cause for concern. The increasing ease and ability to sequence more and more complex genomes, such as projects associated with the 1001G+ and Tree of Life, will allow population analyses to avoid the use of a single reference genome and reveal new mechanisms of gene lyrata reference genome (Hu et al. 2011) and the A. arenosa subgenome of A. suecica (Burns et al. 2021) as a guide followed by manual inspection of regions. The assembly size for 11B02 was 213Mb and 11B21 was 202Mb. Genome size was estimated using findGSE (Sun et al. 2018) with a resulting estimated genome size of ~256Mb for 11B02 and ~237Mb for 11B21.

SNP filtering
From the raw VCF files SNP positions containing heterozygous labels were extracted using GATK VariantFiltration. From the 3.3 millions of heterozygous SNPs extracted, two filtering steps were then applied. Only SNPs with a frequency of at least 5% of the population and located in TAIR10-annotated coding regions were kept. After those filtering steps a core set of 26647 SNPs were retained for further analysis (see Supplemental Figure 3). Gene names and features containing those pseudo-SNPs were extracted from the TAIR10 annotation.

GWAS
The presence and absence of pseudo-heterozygosity (coded as 1 and 0 respectively) was used as a phenotype to run GWAS. As a genotype the matrix published by the 1001 Genomes Consortium containing 10 million SNPs was been used (1001 Genomes Consortium 2016). To run all the GWAS, the pygwas package (https://github.com/timeu/PyGWAS) with the amm (accelerated mixed model) option was used. The raw output containing all SNPs was filtered, removing all SNPs with a minor allele frequency below 0.05 and/or a -log10(p-value) below 4.
For each GWAS performed, the p-value as well as the position was used to call the peaks using the Fourier transform function in R (filterFFT), combined with the peak detection function (peakDetection), from the package NucleR 3.13, to automatically retrieve the position of each peak across the genome. From each peak, the highest SNPs within a region of +/-10kb around the peak center were used (see the example in Supplemental Figure 17). Using all 26647 sequence identity as well as 70% of the initial sequence length was used. The presence of a match within 20kb of the predicted peak position was interpreted as confirmation.

Gene ontology
Out of the 2570 genes detected to be duplicated, 2396 have a gene ontology annotation.
PLAZA.4 ( Van Bel et al. 2018) was used to perform a gene enrichment analysis using the full genome as background. Data were then retrieved and plotted using R.

Coverage and Methylation analysis
Bisulfite reads for the accessions were taken from 1001 methylomes (Kawakatsu et al. 2016).
They methylation levels were calculated either on the gene-body or on 200bp windows using custom python scripts following guidelines from Schultz et al. (2012). Weighted methylation levels were used, i.e. if there are three cytosines with a depth of t1, t2 and t3 and number of methylated reads are c1, c2 and c3, the methylation level was calculated as (c1+c2+c3)/(t1+t2+t3). We called a gene "differentially methylated" if the difference in weighted methylation level was more than 0.05 for CG and 0.03 for CHG.
The sequencing coverage for each accession was extracted using the function bamCoverage (windows size of 50bp) from the program DeepTools (Ramírez et al. 2016). The Bigwig files generated were then processed in R using the package rtracklayer. No correlation between the mean sequencing coverage and the number of pseudo-SNPs detected was observed (Supplemental Figure 18).

Multiple sequence alignment
For each insertion of the AT1G20390-AT1G20400 (Transposon+gene) fragment, a fasta file including 2kb on each side of the fragment was extracted from each genome, using the getfasta function from bedtools (Quinlan and Hall 2010

Structural variation analysis
To control the structure of the region around duplicated genes, the sequence from 3 genes upstream and downstream of the gene of interest was extracted. Each sequence was then BLAST to each of the genomes and the position of each BLAST result was retrieved. NCBI BLAST (Altschul et al. 1990) was used with a percentage of identity threshold of 70% and all other parameters as default. From each blast results fragments with at least 50% of the input sequence length have been selected and plotted using R.

Frequency of the insertions in the 1001 Genomes dataset
The same sequences used for the multiple alignment were used to confirm presence or absence of each insertion in the 1001 Genomes dataset. We used each of those sequences as reference to map short reads using minimap 2 (H. Li 2018). For each insertion, only paired-end reads having both members of the pair mapping to the region were retained. An insertion was considered present in an accession if at least 3 pairs of reads spanned the insertion border (see Supplemental Figure 11).

Multiple species comparison
We used the Capsella rubella and A.arenosa genomes (Slotte et al. 2013;Burns et al. 2021) to search for the new Transposon+gene element, just like in the A. thaliana genomes. For A.
arenosa we used the subgenome of A. suecica. We located the transposon+gene fragments, extracted from the TAIR10 annotation, using NCBI BLAST as above. For A.lyrata two newly assembled genomes were assembled using MinION sequencing.
Additional file 2-8.csv Methylation value per gene of all accessions mapped to the corresponding genome.
CG and CHG weighted average per genes of the 6 accessions analyzed. Row names correspond to the gene ID. (the "_" corresponds to the multiple copies detected). The column name to the CG and CHG for each accession.