Genomic evolutionary analysis in R with geaR

The analysis and interpretation of datasets generated through sequencing large numbers of individual genomes is becoming commonplace in population and evolutionary genetic studies. Here we introduce geaR, a modular R package for evolutionary analysis of genome-wide genotype data. The package leverages the Genomic Data Structure (GDS) format, which enables memory and time efficient querying of genotype datasets compared to standard VCF genotype files. geaR utilizes GRange object classes to partition an analysis based on features from GFF annotation files, select codons based on position or degeneracy, and construct both positional and coordinate genomic windows. Tests of genetic diversity (eg. dXY, π, FST) and admixture along with tree building and sequence output, can be carried out on partitions using a single function regardless of sample ploidy or number of observed alleles. The package and associated documentation are available on GitHub at https://github.com/CMWbio/geaR.


Introduction:
23 Improvements in genome sequencing technologies has led to increased production of data at lower 24 relative cost per base (Schwarze et al. 2020). Genome-wide sequencing datasets with hundreds of 25 samples can be produced for population genomic analysis, allowing researchers to investigate population and evolutionary history at an unprecedented scale. However, due to file size and data 27 complexity downstream problems during data storage and analysis can arise. The most common 28 format for handling genome-wide SNP data is the Variant Call Format (VCF), which has historically 29 had a large memory overhead when being read into an R environment. To resolve this, the Genomic 30 Data Structure (GDS) format has allowed all genotype and metadata to be compressed into a 31 queriable, on-disk file that substantially reduce memory requirements and decrease analysis time 32 (Zheng et al. 2017). The GDS format provides an efficient format for filtering SNP data in order to 33 perform Principal Component Analysis, estimate genetic relatedness and tests for genetic 34 association (Zheng et al. 2012). 35 GDS files use GRange objects from the GenomicRanges package (Lawrence et al. 2013) to define loci 36 to query from file and import into R. In their most basic form, GRange class objects define genomic 37 loci based on reference position. Although widely used throughout Bioconductor, GRange objects, to 38 our knowledge, have not been utilized in the same manner to define loci for evolutionary analyses. 39 Few R packages attempt to carry out genome-wide investigation of genotype data. Most packages 40 focus on the analysis of single or multi-locus data, with the notable exception of PopGenome (Pfeifer 41 et al. 2014). However, one limitation of PopGenome is customizability of how the target genome is 42 partitioned, and which sites are selected for analysis. Most tools, including PopGenome, allow 43 datasets to be partitioned into sliding or tiled windows based on reference or SNP position. 44 PopGenome also provides methods to split data into GFF attributes, however selection of bespoke 45 partitions not possible. This makes calculating population metrics on specific codon positions (eg. 46 four-fold or zero-fold degenerate sites) or analysing many non-contiguous loci difficult and time 47 consuming. 48 To overcome these issues, here we present the R package geaR, which leverages the GDS format, to 49 efficiently construct GRanges containing genome-wide or local loci of interest and to carry out 50 common tasks for evolutionary analysis on genome-wide genotype data using a single function. Furthermore, we provide methods to partition the genome based on annotation, codon position or 52 degeneracy through utilizing data in GFF files, or using reference genome coordinates or genotype 53 position. 54

Features:
55 Input data 56 Genotype input files are required to be in GDS format, enabling high compressibility compared to 57 gzipped VCFs (>5X smaller on disk), efficient querying and the capability to work on large datasets 58 with a reduced memory footprint (Zheng et al. 2017). Conversion of sample genotypes in the VCF 59 format to a Genomic Data Structure (GDS) format can be performed using the SeqArray package 60 (Zheng et al. 2017) before analysis with geaR. Genotypes called at any level of ploidy can be utilized 61 in geaR, which includes whole genome sequence data generated from pools of two or more 62 individuals. 63

Partitioning the genome using GRanges 64
The geaR package utilizes GRange objects to define partitions for the analysis, for example, 65 segmenting a genome into 10-kb windows. This allows users to define their own GRanges for the 66 analysis or build them with provided functions. Currently, users are able to generate both coordinate 67 (based on reference coordinate) and positional (based on genotype number) windows using 68 makeWindows() or makeSnpWindows() functions. Sequence features, such as protein coding 69 regions, can be extracted from a GFF with getFeatures(). 70 Many evolutionary analyses seek to calculate population metrics over different codon positions. To 71 make this as simple as possible, geaR provides methods to index a reference genome according to 72 codon position with buildCodonDB(), which can either be stored in memory as a GRangesList object 73 or an SQLite database (DB) on disk to limit static memory usage. Users may then filter codons based 74 on degeneracy (0-fold or 2-fold) and position using the function filterCodonDB(). A codon DB can also be passed to the function validate4fold() to select 4-fold degenerate sites across the genome that 76 are empirically supported in the GDS file. This is done by querying the GDS to i) remove codons with 77 missing data, ii) select 4-fold degenerate codons, iii) remove all those where codon positions one or 78 two have variation and iv) select third positions. 79 GRange objects generated using geaR can then be combined using mergeLoci() to further customize 80 partitions. For example, genome-wide tiled windows can be combined with four-fold degenerate 81 sites to output either genomic windows that contain only 4-fold sites or all sites excluding 4-fold 82 degenerate sites. 83 Figure 1: Structure of the of the gear S4 object: i) Genomic loci (GRanges) to carry out analyses 84 across, ii) Population metadata encoding sample names to the population/species they belong to, iii) 85 A cog containing general arguments for all analysis, iv) a cog specifying that the Genetic Diversity 86 module should be carried out and v) a cog specifying that the Admixture cog should be carried out 87 on the dataset. 88 89

Setting up an analysis: cogs and gears 90
geaR operates through two S4 classes, the 'cog' and 'gear' (Figure 1). Cogs, built using makeCog(), 91 specify multiple analyses to carry out (see Table 1) setting parameters specific to each analysis. A 92 single gear object can then be constructed, using makeGear(), which contains all of the specified 93 cogs for analysis, along with the genomic loci and population metadata (Figure 2A Functions specific to geaR are within the orange boundary and external functions in purple. After 99 converting the VCF to GDS format using SeqArray, contig metadata (contig length) is used to 100 construct windows across the genome. A dataframe containing population metadata defining 101 population to sample grouping is then constructed. This is used, along with windows for the analysis 102 and cogs, to construct the gear class object. The analysis is then carried out on the gear object and 103 outputs depend on which cogs were specified. B) Workflow used to generate partition schemes for 104 examples in Figure 3. First 10kb windows were generated from contig metadata. This was followed 105 by generation of a codon database that indexes codon position in a reference genome. The codon 106 database was then passed to the function filterCodonDB to output separate loci-sets for four codon 107 partition schemes: 1 st +2 nd ; 3 rd ; 0-fold and 2-fold. validateFourFold() was also used to select 4-fold 108 degenerate sites that are supported by genotypes in the GDS file. Each of these codon loci-sets can 109 then be passed to the function mergeLoci(), along with the 10kb windows, to combine loci into 10kb 110 windows that contain only the selected codon types. 111 112 113

Analysis types 114
Four different cogs can be generated to carry out an analysis: i) genetic diversity, ii) admixture, iii) 115 outputLoci and iv) outputTrees. Genetic diversity allows the calculation of a range of population 116 metrics (Table 1) Outputs of both genetic diversity and admixture cogs can be summarized using summarizeStats() 124 which calculates a mean and median values for each statistic across all loci using a block jack-knife 125 approach. 126 Table 1: Analysis types and functionality available to apply at each locus. 127

Cog type Functionality Genetic diversity
Nucleotide diversity (π), genetic distance (dXY), maximum distance (dmax), minimum distance (dmin), ancestral distance (da), γST, relative node distance (RND), minimum relative node distance (RNDmin), and Gmin Admixture f4 and ̂ Output loci fasta format output to file Output trees newick format distance trees and metadata output to file 128 windows across scaffold_4 of the diamondback moth reference genome. C) The five loci-sets 157 constructed using the workflow in Figure 2B were used to calculate nucleotide diversity (π) at 158 1 st +2 nd , 3 rd , 0-fold, 2-fold and 4-fold codon sites across scaffold_4 of the diamondback moth 159 reference genome C) Admixture metrics f4 and ̂ calculated on 100kb windows across scaffold 160 NW_011876398.1 of the B. dorsalis reference genome. D) Distance trees for each 100kb window 161 across NW_011876398.1 output using the outputTrees cog showing a mixture of discordant and 162 concordant topologies. Plots A), B) and C were generated using ggplot2 (Wickham 2009) and D) 163 using densitree (Bouckaert 2010 Figure 3C).
We also used the outputTrees cog to output distance trees for each of these windows to illustrate 171 both the congruent and incongruent topologies resulting from partial admixture on 172 NW_011876398.1 ( Figure 3D