Variant calling using NGS and sequence capture data for population and evolutionary genomic inferences in Norway Spruce (Picea abies)

Advances in next-generation sequencing methods and the development of new statistical and computational methods have opened up possibilities made for large-scale, high quality genotyping in most organisms. Conifer genomes are large and are known to contain a high fraction of repetitive elements and this complex genome structure has bearings for approaches that aim to use next-generation sequencing methods for genotyping. In this chapter we provide a detailed description of a workflow for variant calling using next-generation sequencing in Norway spruce (Picea abies). The workflow that starts with raw sequencing reads and proceeds through read mapping to variant calling and variant filtering. We illustrate the pipeline using data derived from both whole-genome resequencing data and reduced-representation sequencing. We highlight possible problems and pitfalls of using next-generation sequencing data for genotyping stemming from the complex genome structure of conifers and how those issues can be mitigated or eliminated.


Introduction
Conifers were one of the last plant groups lacking genome assemblies, but recently several draft genomes have become available for a number of conifers such as Norway spruce ( Picea abies , Nystedt et al. 2013), Loblolly pine ( Pinus taeda , Zimin et al 2014Zimin et al , 2017, Sugar pine (Stevens et al 2016) and Douglas Fir . This has opened up new possibilities to assess genome-wide levels of genetic diversity in conifers. Earlier studies of genetic diversity in Norway spruce has either been limited to coding regions (e.g Huertz et al 2006, Chen et al. 2012) or have used various complexity reduction methods, such as genotyping by sequencing, restriction site associated sequencing, or targeted capture sequencing  to estimate levels of genetic diversity within species. While we have learned a lot about levels of genetic diversity in Norway spruce from such studies, we still lack detailed information on, for instance, levels of nucleotide polymorphism and linkage disequilibrium in non-genic regions. However, with the availability of a reference genome sequences (Nystedt et al 2013), whole genome re-sequencing is now also possible in conifers such as Norway spruce.
Conifer genomes are large (20-40Gb) and have high repetitive content and current draft genome assemblies in conifers are therefore often fragmented into many, relatively short scaffolds. In addition, large fractions of the predicted genome sizes are also missing from reference genomes. The fragmented nature of conifer reference genome assemblies, combined with the high repetitive content make variant calling in conifers difficult. This is true regardless of what techniques have been used to generate sequencing data but perhaps more so for whole-genome re-sequencing data that can be expected to provide a relatively unbiased coverage of the target genome. In this chapter we review methods available for variant calling using NGS data and outline some of the issues one may face when performing analyses of data from whole-genome re-sequencing (WGS) in Norway spruce. In particular we discuss the performance of variant calling across different genomic contexts, such as coding and non-coding regions and regions known to be composed of repetitive elements. We also compare variant calling using WGS data with data derived from sequence capture probes, designed to target non-repetitive sequences in the P. abies genome and discuss how collapsed genomic regions in the assembly complicates the task of filtering for good reliable variant-and genotype calls.
Having access to robust variant calls is important for downstream analyses, such as population genomic analyses or inferences of the demographic history of individuals, populations or the species as a whole. To highlight these issues, we end by assessing how different approaches to variant calling alter the site frequency spectrum of variants and hence possible evolutionary inferences drawn from the data.

Sample collection
We sampled 35 individuals of Norway spruce ( Picea abies ) spanning their natural distributions, mainly from Russia, Finland, Sweden, Norway, Belarus, Poland and Romania for use in whole genome re-sequencing. Individuals Pab001-Pab015 were all derived from unique populations and no specific measurements were taken when they were collected. Samples were taken from newly emerged needles or dormant buds for each individual and stored in -80 ºC until DNA extraction. In contrast, individuals Pab016-Pab035 were sampled from two different areas, one in the eastern and one in the western part of Västerbotten province in northern Sweden. Two different populations were sampled in each area, one old and untouched forest (>100 years old) and one young planted population (<20 years old). For every population a transect was made and five trees were sampled from each population along the transect. From each tree, a number of fresh shoots was broken off and put into pre-labeled zip lock bags before being taken back to the lab for DNA extractions.
Genomic DNA was extracted using Qiagen plant mini kit following manufacturer's instructions. All sequencing was performed at the National Genomics Initiative platform (NGI) at the SciLifeLab facilities in Stockholm, Sweden, paired-end libraries with an insert size of 500bp on different Illumina HiSeq platforms (Pab001-Pab006 on HiSeq 2000, remaining individuals on HiSeq X). The original location, estimated coverage from raw sequencing reads and coverage of mapped reads for each individual are given in Table 1.
Samples analysed using sequence capturing methods were obtained from Bernhardsson et al. ( , 1997 haploid samples) and Baison et al. (2018, 526 diploid samples). For further information on sample collection, DNA extraction and sequencing please refer to the original publications.  (Nystedt et al. 2013); ** Down-sampled to match remaining samples

SNP and genotype calling pipeline
Over the recent decades, Next-Generation Sequencing (NGS) technologies have developed at an extraordinary rapid pace and have led to ever decreasing costs per mega-base sequence generated. This has in turn led to rapid increase in the number and diversity of sequenced genomes ( Goodwin et al. 2016 ). These new sequencing technologies have already facilitated and improved many research avenues in biology, such as transcriptomics, gene annotation and RNA splice identification (Bao et al. 2011). Moreover, meta-genomic ( Schuster 2007) and genome methylation analysis (Morozova and Marra 2008) have also benefited from NGS technologies. However, NGS platforms provide higher associated error rates and generally shorter read lengths compared with traditional Sanger sequencing platforms and therefore require much more careful examinations of the results, particularly for variant discovery and downstream applications (Liu et al. 2012;Goodwin et al. 2016 ). Thus, meaningful analysis of NGS data relies crucially on accurate calling of SNPs and genotypes ( Nielsen et al. 2011 ). Here we first highlight some of the issues that may contribute to ambiguities in SNP and genotype calling, e.g. repetitive DNA, or misalignment of sequence reads, and then review some recent algorithms and tools that are capable of improving the sensitivity and specificity of SNP detection and genotype determination from NGS data. The pipeline for SNP and genotype calling applied in this chapter includes initial reads mapping, post-alignment processing, pre-calling processing, final SNP and genotype calling and subsequent variant filtering ( Figure   1). The steps of this pipeline are exemplified by using whole-genome re-sequencing data from 35 Norway spruce ( Picea abies ) individuals generated by different Illumina HiSeq platforms.

Initial reads mapping (step 1)
Mapping is the most fundamental step in variant detection using NGS and involves aligning raw or pre-processed sequence reads to a reference genome of either the same or a closely related species to reveal the location(s) of reads within the reference genome Flicek and Birney 2010;Wang et al. 2015 ). The accuracy of mapping plays a crucial role in variants calling ( Nielsen et al. 2011 ) as incorrectly aligned reads may lead to errors in SNP and genotype calling, and therefore will restrict the accuracy of all downstream analyses. Although progressively more advanced NGS technologies feature several advantages such as high throughput, lower cost per base and higher accuracy when compared with traditional Sanger-based sequencers ( Bao et al. 2011 ), several obstacles stemming from the inherent characteristics of NGS data need to be considered when performing this step.

Short-read mapping
The first challenge is that reads from NGS platforms are usually short (most delivered read lengths fall in the range of 35 -400 bp), when compared with the reads from Sanger sequencers (600 -800 bp) ( Bao et al. 2011;Mardis 2017 ). In order to use these short reads efficiently, the initial process, before actual alignment, involves the construction of indices for the reference sequence and/or for the raw short-read sequence data known as 'indexing', is intended to speed up subsequent mapping steps. The two most commonly used indexing algorithms have been incorporated in a large number of mapping software tools: (i) indexing algorithms based on hash tables, and (ii) indexing algorithms based on Burrows-Wheeler transform (BWT) . A 'Hash table' is a common data structure which makes it possible to search rapidly through complex and non-sequential data using an index. One option is to scan the hash  ( Flicek and Birney 2010 ), they are also more memory-efficient and are par ticularly useful for aligning repetitive reads ( Nielsen et al. 2011 ). Another advantage of the BWT approach is the ability to store the complete reference genome index on disk and load it completely into memory on almost all standard bioinformatics computing clusters ( Flicek 2009;Flicek and Birney 2010 ).
Currently, the most widely used software tools based on the BWT algorithms include Bowtie

Extremely large amounts of data
Another challenge for working with NGS data is that the amount produced by most NGS methods are orders of magnitude greater than that generated by earlier techniques (Flicek and Birney 2010). For example, Illumina (HiSeq 2000) and Applied Biosystems (ABI SOLiD 4) can deliver hundreds of millions of sequences per run, while Sanger-style sequencer only produces thousands of reads per run ( Hung and Weng 2017 ). Therefore, in order to map a vastly greater numbers of reads (millions or billions) to a reference genome, any algorithm must be optimized for speed and memory usage, especially when reference genomes are very huge. Accordingly, several very memory efficient short-read alignment programs have been developed. For example, Bowtie (http://bowtie-bio.sourceforge.net/index.shtml; Langmead et al. 2009 ) and BWA , as examples of the two most efficient short-read aligners, achieve throughputs of 10-40 million reads per hour on a single computer processor (Treangen and Salzberg 2012).

Repetitive DNA mapping
Repetitive DNA sequences, which are similar or identical to sequences elsewhere in the genome, are abundant across a broad range of organisms, from bacteria to mammals (Treangen and Salzberg 2012). Mapping repetitive DNA sequences to a reference genome create ambiguities of deciding what to do with reads that map to multiple locations, which, in turn, can produce errors when interpreting results. Mapping of repetitive sequencing reads is therefore one of the most commonly encountered problem in read mapping. One possible solution to this problem is to simply remove all multiple mapped reads. However, this discarding strategy may complicate the calculation of coverage by reducing coverage in an uneven fashion ( Li et al. 2008a ), and may result in many undetected biologically important variants when analyses are performed across unique regions involved in the repetitive contents in the genome. Some repeats have already proved to play important roles in, for instance, human evolution ( Jurka et al. 2007;Britten 2010) , sometimes creating novel functions while sometimes acting as independent 'selfish' genetic elements ( Kim et al. 2008;Hua-Van et al. 2011;Treangen and Salzberg 2012 ).
An alternative option is to only report the region with the fewest mismatches (e.g. as done in BWA and Bowtie ). Specifically, when there are multiple equally good best alignment matches, the aligner can either randomly choose one or choose to report all such alignments (e.g. SOAP2 ).
Among those aligners reporting all matches, another choice is to report up to a maximum number ( m ) by ignoring multiple reads that align to >m locations. To deal with these issues, the concept of 'mapping quality' was introduced in several software tools to enable evaluation of the likelihood of correct mapping of reads by considering a number of factors, such as base qualities, the number of base mismatches and/or the existence and size of gaps in the alignment (Li et al. 2008a; Wang et al. 2015 ). For example, in order to evaluate the reliability of alignments, MAQ assigns a Phred-scaled quality score which measures the probability that the true alignment is not the one found by MAQ for each individual alignment (Li et al. 2008a). However, MAQ 's formula overestimates the probability of missing the true hit, which results in an underestimation of mapping quality ( Li and Durbin 2009 ). BWA was developed with a similar algorithm as MAQ but has been modified by assuming the true hit can always be found. Simulation reveals that BWA may therefore overestimate mapping quality, although the deviation is relatively small ( Li and Durbin 2009 ).
In general, the choice of alignment tool and the corresponding parameter settings are very important because the outcome will significantly influence the accuracy of variant calling and further downstream analysis. BWA , which is based on the BWT algorithm, is much faster compared to many other programs based on hash table algorithm at the same sensitivity level, has more efficiently memory usage, which is par ticularly useful for aligning repetitive reads, and has a smaller deviation of mapping quality. This is therefore often the best choice for mapping raw sequence reads to a reference genome. BWA provides three different mapping methods: 1) BWA-MEM that is usually suggested for 70bp or longer Illumina, 454, Ion Torrent and Sanger reads, and is also generally recommended for high-quality queries as it is faster and more accurate; 2) BWA-backtrack is especially suited for short sequences and 3) BWA-SW which may have better sensitivity when alignment gaps are frequent. Consequently, in our whole genome re-sequencing project of Norway spruce, we have employed BWA-MEM to align all paired-end reads to the reference genome.

Post-alignment processing (step 2)
After mapping reads to the reference, various post-alignment data processing are usually suggested to facilitate further analytical steps. The most common tasks include output file manipulation (e.g. format converting, indexing) and the creation of summary reports from the alignment process.  (Li et al. 2009c) . Moreover, a detailed summary statistics of aligned output are necessary for evaluating the overall quality and correctness of alignment, which benefits for following steps such as SNP detection. Some aligners usually generate a simple summary describing the alignment process (e.g. Bowtie2 , SOAP2 ), while some are not available, such as BWA . Thus, it is possible to generate a summary report by using other software, e.g. the SAMtools flagstat (Li et al 2009c) and CollectAlignmentSummaryMetric s module of Picard ( http://broadinstitute.github.io/picard/ ).
After mapping raw sequencing reads to the reference genome using BWA-MEM , we have employed S AMtools to manipulate output files, for instance for sorting and indexing. Besides, we also used flagstat in SAMtools to generate summary statistic reports including e.g. the total number of reads that pass or fail QC (quality controls), in order to evaluate the quality of the alignments produced during read mapping. In addition, considering the characteristics of Norway spruce genome which has an extremely large genome size (~20Gb) and very high repetitive content (~70% ) ( Nystedt et al. 2013), it is challenging to accurately call variants both from a computational perspective and from the large data volumes that need to be handled and stored in this project. B efore SNP and genotype calling, we therefore performed several steps to reduce the computational complexity of the SNP calling process. First we split both the output file from the mapping step and the reference into smaller subsets so that multiple data sets can be manipulated in parallel. Also, smaller data sets (in this case number of scaffolds) are crucial to allow software tools, such as GATK, to function properly.
We first reduced the reference genome by only keeping genomic scaffolds greater than 1kb in length using bioawk ( https://github.com/lh3/bioawk ). All BAM files were then subsetted using the reduced reference genome with the view module in SAMtools . All reduced BAM files for each individual, were then merged into a single BAM file using the merge module in SAMtools (Li et al. 2009c). The final step was to split the merged BAM files of each individual into 20 genomic subsets, with each subset containing roughly 100,000 scaffolds, using SAMtools view . We simultaneously subdivided the reference genome into the corresponding 20 genomic subsets by keeping exactly same scaffolds by using bedtools . Indexing was performed for both BAM subsets and reference subsets to make them available for subsequent analyses.

Pre-calling processing (step 3)
Additional processing steps are usually recommended before proceeding to the actual variant calling, so that variant detection results can be more reliable (Li et al. 2009c DNA from a gel slice to get uniform fragment lengths (Li et al. 2009c). Consequently, the same read pairs may occur many times in the raw data and this will result in unexpectedly high depths in some regions , resulting in a skewed coverage distribution (Mielczarek and Szyda2016) .
Further, excessively high read depth could potentially lead to large frequency differences between the two alleles of a heterozygous site (Li et al. 2009c), which may subsequently bias the number of variants and may substantially influence the accuracy of the variant detection (Wang et al. 2005). Removal of these artifacts is thus an essential pre-calling processing step, in particular on applications based on re-sequencing data. Picard MarkDuplicates ( http://broadinstitute.github.io/picard /) and Samtools Rmdup (Li et al. 2009c) are two commonly used software which allow the user to either mark these duplicate reads or completely remove them from the alignment. Both methods rely on similar approaches, however, compared with rmdup , MarkDuplicates is likely a better choice as it also handles inter-chromosomal read pairs, considers the library for each read pair, keeps a read pair from each library and does not actually remove reads but rather flags the duplicates by setting the SAM flag to 1024 for all but the best read pair (Ebbert et al. 2016).

Local realignment
Another important preprocessing step involves alignment corrections. Artefacts in alignment, usually occurring in regions with insertions and/or deletions (indels) during the mapping step, can result in many mismatching bases relative to the reference in regions of misalignment. Such misaligned regions can be easily mistaken as SNPs during variant calling.
Read mapping proceeds by independently mapping each read, and reads covering an indel at the start or end of the read can often be incorrectly mapped even when other reads are correctly In this project, we first marked potentially PCR duplicates by using MarkDuplicates in Picard . We then performed local realignment by first detecting suspicious intervals using RealignerTargetCreator , followed by realignment of those intervals using IndelRealigner , both implemented in GATK ( DePristo et al. 2011 ). Both of these steps of pre-call processing were run separately on the 20 genomic subsets of each individual (700 subsets in total across all 35 individuals).

SNP and genotype calling (step 4)
The development of NGS has made it possible to identify a large number of variants in almost any organism. Sequencing entire genomes potentially allows for the discovery of all existing polymorphisms, and can thus identify not only common but also rare SNPs which have been implicated in controlling many complex phenotypes ( Mielczarek and Szyda 2016 ). This means that it should be possible to, at least in theory, identify the true causal mutations directly from resequencing data rather than relying on using linkage disequilibrium between unknown causal mutations and marker SNPs. However, to achieve this, SNP detection and genotype calling need to be performed across polymorphic sites in the NGS data. SNP calling, also known as variant calling, is aimed at detecting sites which differ from a reference sequence, while genotype calling is the process of determining the actual genotype for each individual based on positions in which a SNP or a variant has already been called (Nielsen et al. 2011;Li et al. 2013 However, t hese methods can be improved by using more empirically determined cutoffs ( Nielsen et al. 2011 ).

Probabilistic methods
More recent approaches integrate several sources of information within a probabilistic framework. One of the advantages of a probabilistic framework is that it facilitates SNP calling in regions with medium to low coverage compared with h euristic methods ( Altmann et al. 2012 ).
For moderate or low sequencing depths, fixed cutoffs based genotype calling will lead to under-calling of heterozygous genotypes and some information will inevitably be lost when using static filtering criteria (Nielsen et al. 2011). Another advantage of probabilistic methods is that it can account for uncertainty in genotype call, making it possible to monitor the accuracy of genotype calling ( Altmann et al. 2012;Mielczarek and Szyda 2016 ). Additional information concerning allele frequencies and/or patterns of linkage disequilibrium (LD) can thus be incorporated in downstream analysis (Nielsen et al. 2011;Mielczarek and Szyda 2016 ).
SNP and genotype calling based on probabilistic methods usually involve the calculation of genotype likelihood which can be used to evaluate the quality scores for each read. To improve the accuracy of calculation of genotype likelihoods, the following parameters can be considered: 1) a weighting scheme should be used that takes correlated errors into account (Li et al. 2008a), 2) recalibrating the per-base quality scores by using empirical data will also improve genotype likelihoods ( Nielsen et al. 2011 ) and 3) information from the SNP-calling step should be incorporated into the genotype-calling algorithm, leading to genotype likelihoods that are calculated by conditioning on the site containing a polymorphism ( Nielsen et al. 2011 ). By adopting a Bayesian approach for variant calling, prior genotype probabilities can be combined with the estimated genotype likelihood to calculate the posterior probability of a particular genotype. The genotype with the highest posterior probability will then generally be chosen and this probability or the ratio between the highest and the second highest genotype probabilities will be used as a measure of confidence in the variant call (Nielsen et al. 2011;Li et al. 2013; Mielczarek and Szyda 2016 ).
A prior probability of genotype must be assumed because it is a prerequisite to be able to calculate the posterior probability for a genotype. When data is analysed from a single individual, either a uniform prior probability is chosen that assign equal probability to all genotypes or a non-uniform prior can be based on external information, such as dbSNP (SNP database) entries, the reference sequence or an available population sample (Nielsen et al. 2011;Li et al. 2013 ) . Jointly analyzing multiple individuals will improve the prior probabilities by considering allele or geno type frequencies, using for example maximum likelihood (Li et al. 2009b;Martin et al. 2010 ) and then by using a Hardy-Weinberg equilibrium (HWE) assumption or other assumptions that relate allele fre quencies to genotype frequencies to calculate genotype probabilities (Nielsen et al. 2011 ) . Moreover, a significant improvement in genotype-calling accuracy can be achieved by considering linkage disequilibrium (LD) information ( 1000 Genomes Project Consortium 2010). However, this approach is not very efficient for rare mutations.

Commonly used software tools based on probabilistic methods
Several software tools have been developed that combine probabilistic framework with Bayesian analysis for variant calling, such as SAMtools (Li et al. 2009c), GATK (DePristo et al.

2011) and
FreeBayes (Garrison and Martin 2012). All these software also support the use of multiple sample SNP calling. Samtools perform SNP and genotype calling in two steps:1) mpileup in Samtools computes the possible genotype likelihood and stores those information in BCF (Binary call formats), and 2) BCFtools from the SAMtools packages applies the prior and does the actual variant calling based on genotype likelihoods information calculated in the previous step (Li et al. 2009c;Wang et al.2015). Compared with Samtools , GATK , based on similar algorithms, features an advantage of automatically applying several filters before processing variant calling or other pre-and post-processing steps, e.g. filtering out reads that fail quality checks and reads with a mapping quality of zero. HaplotypeCaller, the most popular SNP and genotype calling module in GATK , c an discard the existing mapping information and completely reassembles reads in the region whenever a region show signs of variation. It thus allows for more accurate calling in regions that are traditionally difficult to call, such as regions containing different types of variants close to each other. Four steps will be performed by using HaplotypeCaller in GATK : 1) determining the regions of the genome that it needs to operate on by detecting significant evidence for variation ; 2) determining haplotypes by local assembly of the active region by building a De Bruijn-like graph and by identifying potential variant sites by realigning each haplotype against the reference haplotype, 3) determining likelihoods of the haplotypes given the read data by a pairwise alignment of each read against each haplotype using the PairHMM algorithm and 4) assigning sample genotypes based on Bayes' rule . Liu et al. (2013) (Figure 1). The raw variant calls are likely to contain many false positives arising from errors in the genotyping step or form incorrect alignment of the sequencing data and the called variants therefore needs to be subjected to a number of subsequent filtering steps (e.g. missing data, depth and mapping quality) to produce data that is of sufficient quality for answering the biological questions of interest. In addition to the usual problems of calling variants from data with low coverage, SNP calling in conifers face the extra problem of possible collapsed repetitive regions in the assembly. The spruce genome, as most conifers, is highly repetitive and the v1.0 P. abies assembly is missing approximately 30% of the predicted genome size (Nystedt et al 2013). These genome regions are either completely absent from the reference assembly or present as collapsed regions of high sequence similarity. This introduces problems for read mapping and variant calling as it increase the probability of calling false SNPs in collapsed regions since reads that in reality derive from different genomic regions are mapped at a single region in the assembly. In order to reduce the impact of these issues on variant calling, we performed a second filtering step so that genotype calls with a depth outside the range 6-30 and a GQ < 15 were re-coded to missing data. We then filtered each SNP for being variable with an overall average depth in the range of 8-20 and a 'maximum missing' value of 0.8 (max 20% missing data). All info tags were then recalculated with bcftools fill-tags . In order to evaluate the filtering parameters, we extracted summary statistics for each of the 20 genomic subsets and filtering steps using bcftools stats (Li et al. 2009 vcf=pysam.VariantFile("Subset.vcf.gz") out=open("Subset_Allele_stats.txt","w") out.write("{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\n".format ("Chrom","Start","End","HWE","ExcHet","Called_samples","Allele_frequency" ,"Heterozygotes","Alt_read_proprotion","Total_Het_DP","Total_DP")) for record in vcf. For analyses of different genomic regions separately (ie. repeat regions, outside repeat regions, genic regions and exonic regions), we subset the allele statistics file (above) to corresponding regions in files Pabies1.0-all.phase.changed.gff3 and

Results
The raw unfiltered VCF files contain a total of 750 million records of which 710 million were single nucleotide polymorphisms (SNPs). After hard-filtering using GATK best-practice recommendations ( https://gatkforums.broadinstitute.org/gatk/discussion/2806/howto-apply-hard-filters-to-a-call-set ), 545 million SNPs (76.8%) remained. These variants were distributed over 94.7% of the 1,970,460 scaffolds in the assembly that are longer than 1 Kb (Table 2). When comparing the alternative allele frequency with the proportion of heterozygous calls per SNP, it is obvious that we have a high proportion of SNPs that do not follow Hardy-Weinberg expectations (Figure 2A).
SNPs with either too high or too low fraction of heterozygous calls most often also show a greater deviation in allele ratio in their heterozygous calls compared to SNPs following H-W expectations ( Figure 2B-D).
In genotyping by sequencing (GBS) data, often applied to species that lack a reference filtering for sequencing depth, so that all accepted calls have a coverage within the expected span (based on the estimated sequencing depth). We therefore re-coded individual genotypes to missing data if they had a depth that was too low to reliable call both alleles (< 6 reads covering a site), or if their depth was too high compared to the overall coverage (> 30 reads covering a site), which makes it likely that reads are derived from multiple genomic positions. Calls with a genotype quality (GQ) less than 15, which indicates that the difference in likelihood between the best and second best genotype is small, were also re-coded to missing data. We then filtered on an overall average depth per SNP of 8-20 to reduce the proportion of possible erroneous calls further and also added a threshold of maximum 20% missing data in order to keep a SNP for downstream analysis. 296 million SNPs, corresponding to 41.7% of the raw unfiltered SNPs, remained after these filtering steps and were distributed over 63.2% of the >1Kb scaffolds (Table 2). When comparing the proportion of heterozygous calls to the alternative allele frequency, a noticeable amount of the SNPs with heterozygous deficiency has been filtered out ( Figure 2E). The deviation from a balanced allele ratio of heterozygous calls is also visibly lower ( Figure 2F-H) even though the data still suffers from SNPs showing heterozygous excess at intermediate alternative allele frequencies ( Figure 2H at ~0.5). Such excess of heterozygous could be explained by real biological phenomena, such as balancing selection (Charlesworth 2006), or arise due to artefacts caused by collapsed/duplicated regions in the assembly as discussed above (McKinney et al. 2017). Regions under balancing selection should be highly localised in the genome while the risk of collapsed regions in the spruce assembly is a priori expected to be high since we know that the genome contains a huge proportion of retrotransposon-derived sequences (Nystedt et al. 2013, Cossu et al 2017. To filter out SNPs with a significant excess of heterozygotes therefore seems justifiable for the data at hand. Heterozygous deficiency, on the other hand, can be both explained by errors in variant calling in low-depth regions or by population structure (Hartl and Clark 1989). The latter is a much more likely scenario for our data set, since samples are derived from all over the distribution range of P. abies and we also expect this pattern to manifest itself across most of the genome. We therefore chose not to filter on this criteria. By removing all SNPs showing a significant excess of heterozygotes ( p -value < 0.05), we ultimately retained 294 million SNPs (corresponding to 41.4% of the unfiltered SNPs) from 63.2% of the > 1kb scaffolds (Table 2, Figure 2I-M).

Comparison of WGS and reduced representation sequencing data
Even with a number of conifer draft genomes available (Nystedt et al. 2013, Stevens et al 2016, the method of choice for analysing sequence diversity in conifers has been different kinds of reduced representation sequencing methods, such as genotyping by sequencing (GBS), restriction site associated DNA sequencing (RadSeq) or capture probe sequencing (Syvänen 2005, Andrews et al. 2016. These methods all have in common that they target a small fraction of the target genome, the only thing that differ is how this fraction is selected. In order to compare the behaviour of our spruce WGS data with data derived from a reduced representation sequencing technique, we analysed at set of 526 samples that had been genotyped using a set of 40,018 sequence capture probes that had been designed to target regions within genic regions of the v1.1 P. abies genome assembly . The same filtering steps as described in the preceding section were used for the capture probe data set, although the thresholds for individual depth range (6-40), overall average depth range (10-30) and significance level for excess of heterozygotes ( p -value <1e-10) were altered to fit the size of data set (the number of samples and estimated sequencing depth).
Even though the probes were designed to target unique regions in the assembly , the GATK SNP quality filter show a remarkably high level of SNPs with deviations from Hardy-Weinberg equilibrium ( Figure 3A) with large deviations from balanced allele ratios in heterozygous calls ( Figure 3B-D).  Figure 3). However, the capture probe data set only contain plus trees sampled from Southern/Central Sweden  and it is therefore not surprising we observe less effects of population structure in this data, since these trees cover a much smaller geographic region than the samples used in the WGS data. Nevertheless, even after removing SNPs showing an excess of heterozygotes, we see a bias towards the reference allele in the sequence capture data that is not visible in the WGS data (bias towards allele frequencies <0.5 in Figure 3J). This is most probably due to an artefact of using probes, since the probes were designed against the reference alleles of the genome and likely work better in regions with low to moderate amounts of variation .
To evaluate the proportion of collapsed regions within genic regions even further, we analysed 1997 haploid sequence captured samples that had previously been mapped to the whole genome, called using a diploid ploidy setting and subset to the probe regions ± 100 bp . These samples, which should only have homozygous calls, showed an average heterozygosity level of approximately 3.7% (obviously contaminated samples with heterozygosity levels > 10% removed). 114K SNPs remained after GATK hard filtering and with no more than 30% missing data, and these showed a median heterozygosity level of 0.01 (1st and 3rd quartile of 0.004 and 0.017, respectively). However, 10% of the SNPs experienced a heterozygosity level > 0.05 (corresponding to 70-98 heterozygous calls depending on call rate), with a maximum of 0.95 (data not shown).

Comparisons of genic, inter-genic and repetitive regions
In order to analyse how different regions of the spruce genome behave with regards to deviations from Hardy-Weinberg equilibrium, we subdivided the GATK SNP quality + depth and genotype quality filtered WGS data set into four sets: inside repeat regions (  In order to understand how the two filtering steps, GATK best practice SNP quality filters (hereafter called GATK ) and GATK practice SNP quality filters and depth and genotype quality (hereafter called GATK+GT ), change the SNPs retained across the different genomic subsets, we analysed summary statistics from all subsets. The fraction of SNPs retained in subsest after the GATK filter were strongly negatively correlated with fraction of sites covered by repeats in the subsets ( Figure 5A, correlation -0.92,p-value=1.4e-8). This correlation was reduced by using the GATK+GT filtering criteria to the subsets ( Figure 5A, correlation between fraction retained SNPs and repeat coverage of -0.77, p-value=8.1e-5), but remained negative resulting in fewer SNPs being called in genomic subsets containing higher fraction of repetitive sequences. The median physical distance between SNPs increased from 16.5 bp for the GATK filtered subsets (ranging from 15.0 to 55.7 bp per subset, with an overall average of 17.3 bp) to 36.3 bp for the GATK+GT filtered subsets (ranging from 23.2 to 512.8 bp per subset with an overall average of 32.0 bp) ( Figure 5C, outliers not shown). Even though this suggest a high level of nucleotide diversity in Norway spruce, most alleles are rare and as many as 26% of the SNPs appear as singletons in the GATK+GT filtered data set, compared to 21% in the GATK filtered data set. The GATK+GT filter has a large impact on the fraction of scaffolds that retain any SNPs, decreasing from 94.7% for the GATK filtered data set to only 63.2% under GATK+GT ( Figure 5B). The transition-transversion ratio similarly decreased with the GATK+GT filter ( Figure 5D). alternative, a genotype which should be impossible in this individual, in 0.07% of the calls using the GATK filtering criteria. This fraction was reduced to 0.02% when using the GATK+GT filtering criteria, suggesting that the depth filter do improve the quality of the genotype calls and gives an overall higher average depth in comparison to the GATK filtered data set ( Figure 6C).
Although the average sequencing depth of called genotypes increase following GATK+GT filtering, the proportion of missing calls also increase, and reaches 30% for samples with the lowest estimated sequencing coverage ( Figure 6E). The increase in missingness do not affect the proportion of singletons per sample, which stays roughly constant under both the GATK (average of 0.8%) as well as for the GATK+GT (average of 0.9%) filtering criteria ( Figure 6F).

Effects of filtering on estimates of the site frequency spectrum
To analyse how different summary statistics regarding the site frequency spectrum (SFS) is affected by filtering parameters ( GATK and GATK + GT + ExcHet ), we analysed Tajima's D (Tajima 1989) and pairwise nucleotide diversity (π, Hartl and Clark 1989) on a per scaffold basis for genomic subset 6, for inside repeat regions, outside repeat regions, genic regions and exonic regions separately. The GATK filtered data showed an overall higher estimate of Tajima's D than the GATK + GT + ExcHet filtered data for all four genomic regions (average of -0.35 --0.13 and -0.57 --0.53 for GATK and GATK + GT + ExcHet , respectively). These estimates were however highly correlated between the filtering parameters in all four genomic regions (correlation of 0.77-0.82 with p-value < 2.2e-16 for all four genomic regions, Figure   7A-D). The GATK filtered data also show an overall slightly higher nucleotide diversity level (π) compared to the GATK + GT + ExcHet filtered data (average of 3.7e-4 -1.6e-3 and 2.1e-4 -1.0e-3 for GATK and GATK + GT + ExcHet , respectively), with lower diversity level for the same amount of analysed variants in the fully filtered data set ( Figure 7E-H). Smaller differences regarding the SFS between genomic regions could be found in the fully filtered data in comparison to the GATK filtered data, indicating that we managed to remove false SNPs due to collapsed genomic regions in the assembly without altering the SFS by filter for minor allele frequencies.

Conclusions and recommendations
Having high-quality variant calls is essential for downstream analyses such as population genomic studies, inferences of demographic history or complex trait dissection using, for instance, genome wide association studies (Nielsen et al 2011). Variant calling in conifers pose a number of challenges arising from the complex highly repetitive nature of conifer genomes.
Even in the most complete reference genomes of conifers, large regions of the genome are still lacking and this cause problems for mapping of short (<250bp) sequencing reads, since reads from such regions will not match any region in the reference genome. Such reads may fail to map altogether and will be eliminated from further steps in the variant calling pipeline. However, if they are derived from repetitive regions, highly similar, genomic regions can be represented in the assembly allowing read mapping. To a lesser degree these issues are true also for sequencing reads from repetitive regions that are represented in the assembly as these reads may match equally well to multiple regions in the reference genome. The end result in both cases is thus that some regions are being covered by sequencing reads originating from different genomic regions which may ultimately result in the calling of false genetic variants as nucleotide substitutions that differentiate the paralogous regions are called as true SNPs (Gayral et al., 2013). Such variants often can have high quality scores, making it hard to filter them out using even the best practice procedures for variant filtering (Nielsen et al 2011).
These issues are apparent in our data even when using what is considered rather stringent criteria for variant filtering (Figure 2). To alleviate these issues we have implemented further filtering on sequencing depth and on excessive heterozygosity. Filtering on too high depth is expected to target primarily false variants that are derived from possible repetitive regions where as filtering on too low depth targets variants where the low sequencing depth make calling of both alleles in heterozygous individuals unlikely. Filtering on low depth should thus target called variants that may have deflated levels of heterozygosity where as filtering on high depth primarily should target variants with excessive heterozygosity. However, even very stringent filtering on high depth does not completely eliminate problems the excess heterozygosity and we therefore have included a final filtering step aimed specifically at excess heterozygosity ( Figure   2).
To a certain extent, having access to better and more contiguous reference genomes will alleviate some of these issues but it will likely not eliminate them completely. So even as conifer reference genomes improve due to, for instance, the use of long-read sequencing technologies re-sequencing and variant calling from short read data will still be difficult and fraught with both false positive and false negative SNPs. One possible solution to these issues is to base re-sequencing on haploid tissue, e.g from megagametophytes. Using haploid tissue has the benefit that heterozygous genotypes should be absent and any heterozygous genotype call observed is thus likely to arise to to either sequencing errors or from mapping of paralogous reads to a single region in the reference genome. Utilizing haploid tissues to rule out issues arising from paralogy has already been used in conifers to improve, for instance, transcriptome assembly (Ojeda et al. 2018). Another possible way forward would be to utilize longer reads for re-sequencing, thereby reducing the probability that a single read may map to multiple regions in the reference genome. Although the cost and throughput of current long-read technologies make genome re-sequencing of conifers using exclusively long reads infeasible, declining costs and technology development could open up possibilities for re-sequencing using long reads in the future.