RAREsim: A simulation method for very rare genetic variants

Identification of rare variant associations is crucial to fully characterize the genetic architecture of complex traits and diseases. Essential in this process is the evaluation of novel methods in simulated data that mirrors the distribution of rare variants and haplotype structure in real data. Additionally, importing real variant annotation enables in silico comparison of methods that focus on putative causal variants, such as rare variant association tests, and polygenic scoring methods. Existing simulation methods are either unable to employ real variant annotation or severely under- or over-estimate the number of singletons and doubletons reducing the ability to generalize simulation results to real studies. We present RAREsim, a flexible and accurate rare variant simulation algorithm. Using parameters and haplotypes derived from real sequencing data, RAREsim efficiently simulates the expected variant distribution and enables real variant annotations. We highlight RAREsim’s utility across various genetic regions, sample sizes, ancestries, and variant classes.


Introduction 1
Studies of rare variants are important to gain a full understanding of the genetics of health 2 and disease, informing targeted drug development and precision medicine. Rare variants (minor 3 allele frequency (MAF) <1%) have been associated with traits across many diseases including 4 cancer, kidney, neurodevelopmental, cardiovascular, and infectious disease. With decreasing 5 sequencing costs, rare variant data are increasingly accessible 1 resulting in large sequencing 6 studies (e.g. >35,000; 45,000; and 70,000 subjects), databases including the UKBiobank, 7 GenomeAsia, and NIH programs such as the Genome Sequencing Program (GSP) and Trans-8 Omics for Precision Medicine (TOPMed). Rare variant methods continue to be developed (e.g. 9 SKAT-O, iECAT, ProxECAT, and ACAT) to take advantage of the ever increasing sequencing 10 data. 11 Simulation studies enable evaluation of methods and study design (e.g. power and sample 12 size estimates) in known and controlled settings. Simulations that do not adequately mirror 13 essential properties of real data may have issues generalizing to real data, potentially resulting in 14 incorrect conclusions of method efficacy or power. In general, four qualities are necessary to 15 emulate in rare variant simulations of a genetic region: (1) allele frequency spectrum (AFS), (2) 16 total number of variants, (3) haplotype structure, and (4) variant annotation. To our knowledge, 17 no rare variant simulation method currently exists that incorporates all four qualities. 18 (1) AFS is the distribution of variant allele frequencies within a genetic region. Numerous 19 studies have shown that the AFS is skewed towards very rare variants with the vast 20 majority of variants being singletons and doubletons 2-4 . 21 simulation dataset is required, the target dataset is not necessary if default or user defined 1 parameters are used. 2 RAREsim has three main steps: (1) simulating haplotypes, (2) estimating expected 3 number of variants, and (3) pruning rare variants to match expected. A flowchart summarizing 4 the RAREsim algorithm is in Figure 1. 5

(1) Simulating an abundance of rare variants 6
RAREsim uses HAPGEN2 13 to simulate haplotypes for individuals. HAPGEN2 simulates 7 haplotypes by creating mosaics of input and already simulated haplotypes, using recombination 8 information so that regional LD is retained 13,15 . When all sequencing bases, including 9 monomorphic, are included in the input haplotypes more de novo variants are simulated than 10 expected. By sampling from previously simulated haplotypes to create a new haplotype, 11 HAPGEN2 resamples de novo variants resulting in inflated numbers of rare variants. 12

(2) Estimating expected number of variants per MAC bin 13
The number of variants per MAC bin is estimated using two functions. The parameters for 14 these functions are estimated using target data (described below). Alternatively, user-defined or 15 default parameters can be used, eliminating the need for the user to provide and fit target data. 16 The default parameters were derived using the default target data (Methods). 17

(2a) Number of Variants Function 18
The total number of variants in a region depends on the sample size. RAREsim estimates the 19 expected number of variants per kilobase (Kb) for a sample size using the Number of Variants 20 function. Specifically, 21 where ( ) is the number of variants per Kb for individuals. The parameters and 1 are estimated to modify the scale and shape of the function, respectively. When simulating 2 individuals, RAREsim calculates the total number of variants in the region by multiplying the 3 size of the region in Kb, , by the expected number of variants per Kb, ( = ). 4 The target data used for the Number of Variants function provides the observed number 5 of variants per Kb, , in the simulation region observed at sample size . The parameters are 6 optimized by minimizing a least squares loss function summed over all observed sample sizes in 7 the target data: 8 min , (∑( − ) 2 ). 9 Sequential quadratic programming (SQP) via the slsqp function in the nloptr R package 17 is used 10 with constraints 0 < < 1 and > 0 to minimize the loss function. The , 20 The scale parameter ensures the sum of all individual rare MAC proportions equals the total 2 proportion of rare variants observed in the target data, . Parameters and determine the 3 shape of the distribution. 4 Because individual MAC = may have no observed minor alleles in the target data, 5 particularly for higher rare MACs (e.g. MAC=10, 11, 12), MAC bins are used to optimize 6 parameters. MAC bins are mutually exclusive, exhaustive groups of rare MACs. Seven MAC 7 bins are used here: singletons, doubletons, MAC = 3-5, MAC = 6-10, MAC = 11-20, MAC = 21-8 MAF = 0.5%, MAF = 0.5% -1%, denoted as 1 , 2 , … , 7 , respectively. The total 9 number and thresholds to define the bins can be modified by the user. Within each bin j, the 10 estimated proportion of variants, ∑ ( ( + ) ) ∈ , is compared to the observed proportion of 11 variants in the target data, ∑ ∈ . is the observed proportion of variants with MAC = . 12 Parameter estimates for and are found by minimizing the least squares loss over all bins 13 using SQP 17 constraining > 0, 14 The expected total number of rare variants is calculated by summing across all rare MAC bins. 2 The total number of simulated rare variants ( ) is calculated by summing over all MAC 4 bins. 5 = ∑ , 6 (3) Pruning 7 As described in (1), simulations using HAPGEN2 usually result in a larger total number 8 of simulated rare variants than expected from step (2). Simulated variants are pruned by retuning 9 all or a subset of alternate alleles to reference alleles. Within HAPGEN2 and similar to real 10 haplotypes, rare alleles have a high probability of being on the same haplotype background. 11 Pruning alternate alleles preserves the high likelihood that rare alleles are on the same haplotype. 12 Variants are probabilistically pruned, creating variability over the simulation replicates in the 13 number of variants per MAC bin. 14 RAREsim sequentially prunes variants from high to low MAC bins starting with the 15 highest rare MAC bin that has at least 10% more simulated variants than expected 16 . 17 (1) For MAC bins with more simulated variants than expected, a simulated variant is 18 pruned with probability ( ) , where 19 . 20 For each variant within the bin, RAREsim randomly draws from a Uniform(0,1) 1 distribution. If the draw is within [0, ( ) ], the variant is pruned. The location of 2 a pruned variant is stored to allow variants to be added back at lower MAC bins that 3 have fewer simulated variants than expected, as described in the next section. 4 (2) For MAC bins with fewer simulated variants than expected (i.e. European, N=56,885; South Asian, N = 15,308) (Supplemental Table 1). Data from 20 chromosome 19 was divided into 1 cM blocks; six blocks were merged with the preceding 21 adjacent block (Methods), resulting in 101 blocks for simulation (Supplemental Table 2). 22 The ancestry/sample-size specific fitted Number of Variants function closely matches the 1 observed values for all four ancestries (Figure 2, Supplemental Figure 1). Ninety percent of cM 2 blocks had a relative difference within 2.42% for the estimated vs. observed number of variants 3 per Kb. The average relative difference for all ancestries was −1.10% (90% CI = 4 (−1.17%, −1.02%)), with ancestry specific averages of −0.65% (African, 90% CI = 5 Finnish European, 90% CI = (−1.74%, −1.55%)), and −1.00% (South Asian, 90% CI = 7 (−1.13%, −0.87%)) (Supplemental Figure 2). The negative mean relative differences indicate 8 a slight but systematic overestimation of the number of variants per Kb for most blocks. Within a 9 given target dataset, the Number of Variants function appears to slightly overestimate the 10 observations of larger sample sizes and underestimate the observations of smaller sample sizes 11 (Supplemental Figure 1). 12 Observed and estimated variation in the number of variants per Kb across cM blocks 13 increases with sample size (Figure 2A). However, even for the largest available target data 14 sample size (Non-Finnish European, N=56,885), the variability of the number of variants per Kb 15 remains low with 90% of the block-specific estimates within 35 variants of the median estimate. 16 The AFS function matched the observed data well with no apparent systematic bias. The 17 average absolute difference between the observed and estimated proportion of variants in each 18 MAC bin over all ancestries, blocks, and MAC bins was 0.53% (90% CI = (0.51%, 0.55%)) 19 ( Figure 2B). Ninety percent of the estimated proportions were within 1.3% of that observed. The 20 maximum absolute difference in MAC bin proportion was 4.50% (observed in East Asian, MAC 21 3-5). Singleton counts matched particularly well, with a maximum absolute difference of 0.73%. 22 Despite different ancestries and widely different sample sizes (from N = 8,128 to N = 23 56,885 for African and Non-Finnish European respectively) the proportion of variants per MAC 1 bin were similar (Supplemental Figure 3). There is more variation between ancestry/sample-2 size groups for the total proportion of rare variants (i.e. proportion of all variants with MAF 3 <1%) (Supplemental Figure 4). Regardless, within each ancestry, variation of the AFS between 4 cM blocks remains small. For instance, the MAC bin with the most variation (East Asian 5 singleton bin) has 90% of the blocks having estimated proportions within 6.3% of the median. As expected, we observed more functional SNVs than synonymous 2 , with the largest 10 differences observed at MAC ≤ 5 (Supplemental Figure 15). This resulted in substantially 11 different fitted Number of Variants functions for the two types of variants (Supplemental 12  Table 7). Here we present RAREsim, a rare variant simulation algorithm. Unlike HAPGEN2, 22 which either severely under or over simulates the proportion of very rare variants, RAREsim 23 simulates the expected proportion of rare and very rare variants across a variety of genetic 1 regions, ancestries, and sample sizes. RAREsim produces simulations that match the expected 2 AFS, total number of variants, and haplotype structure while enabling variant annotation. To our 3 knowledge, no other existing simulation software is able to emulate real data in all of these areas. 4 We show that RAREsim's ancestry specific default parameters derived from the coding regions 5 of chromosome 19 generalize to other chromosomes, datasets, sample sizes 19 , and non-coding 6 regions, approximating the number of variants per MAC bin with remarkable accuracy. We offer 7 user flexibility by enabling use of RAREsim with default parameters, user defined parameters, or 8 parameters estimated to match user provided target data. 9 For typical uses of simulated genetic data (i.e. evaluating or comparing methods and 10 general power analysis) we recommend simulating with the default parameters. Default 11 parameters were shown to be robust across sample sizes, chromosomes, coding and intergenic 12 regions, and datasets. It is possible, although we believe unlikely, that the default parameters will 13 perform poorly when used in scenarios not evaluated here. If precise matching of a particular 14 empirical data characteristic such as functional variant type, genetic region, sample size, or 15 ancestry is important, we recommend re-estimating the simulation parameters using RAREsim 16 functions. Additionally, parameters are able to be specified without fitting target data. For 17 example, to simulate a specific ancestry, a user could make an educated decision on the total 18 number of variants based on the relationship to the ancestries evaluated here (e.g. total number of 19 variants between that of African and European). 20 RAREsim simulates haplotypes in the same form as HAPGEN2: hap/leg/sample files 20 . 21 Haplotype files can be converted to vcf files using the SHAPEIT convert command 21 or bcftools 22 haplegendsample2vcf command 22 . Genetic association with disease can be simulated from a 23 sample of generated haplotypes using an existing software such as PhenotypeSimulator 23 . 1 Simulation of families or large pedigrees can be performed with a pedigree simulation software 2 such as ped-sim 24 . 3 It has been shown that sample sizes in the tens to hundreds of thousands are needed to 4 have sufficient power to detect associations with rare variants 25 . Due to the lack of very large 5 (>100,000), ancestry specific, publicly available target data at the time of publication, we could 6 not assess the accuracy of the RAREsim simulations for very large sample sizes. As genetic 7 sequencing resources continue to increase in size, RAREsim will be ideally suited to simulate 8 large sample sizes with estimation of new simulation parameters. For the Number of Variants 9 function, the extrapolated, ancestry specific Number of Variants functions were compared with 10 that of the full sample available in gnomAD v2.1. We believe that the Number of Variants 11 function is able to accurately simulate sample sizes up to what is observed in gnomAD v2.1 12 (~125,000). Alternatively, users can use population-genetics theory or other resources to make 13 an informed decision for the total number of variants expected for very large samples. One such 14 resource is the Capture-Recapture 26 algorithm, which can be used to estimate the number of 15 segregating sites given allele count data. While Capture-Recapture can extrapolate to larger 16 sample sizes, the software cannot be easily used for sample sizes that are smaller than the 17 observed target data. RAREsim does not currently modify the AFS function as sample size 18 increases. Consistent AFS were observed over the gnomAD sample sizes (N = 8,128 -56,88). 19 However, we expect the AFS to deviate with very large sample sizes. A user can update the AFS 20 function parameters if desired, and research is ongoing to estimate the expected AFS for very 21 large sample sizes. We believe that RAREsim can currently accurately simulate sample sizes up 22 to ~125,000. 23 There are several limitations to RAREsim. First, RAREsim is only as good as the data on 1 which the simulations are based. Errors or inconsistencies in the target data or input simulation 2 haplotypes will be propagated through the simulations. Secondly, the default parameters were 3 developed and evaluated on autosomes. A user may fit sex chromosome target data or assume 4 parameters extend to sex chromosomes. Finally, RAREsim requires additional memory (e.g. >32 5 GB RAM) when simulating large regions and sample sizes. For efficient simulation of large 6 regions and sample sizes, highmem computing or breaking up the simulation region into smaller 7 portions and combining after simulation is needed. We are actively working on extensions for 8 these limitations. 9 One of the primary benefits of RAREsim is its ability to match real target data, either 10 provided by the user or as done here with gnomAD v2.1. Matching observed data allows 11 RAREsim to adapt as sequencing data evolves due to technological advances or improved 12 genetic resources from additional ancestral populations and increased sample sizes. For example, 13 RAREsim will be able to approximate TopMED 27 and ALFA, aggregated allele frequencies from 14 dbGaP 28 once these resources are released. RAREsim can also simulate unique characteristics of 15 a specific genetic region such as haploinsufficiency, contribution to a polygenic risk score, or 16 selection. The flexibility of RAREsim to emulate real data allows users to assess methods and 17 complete power analyses for in relevant and realistic genetic regions and samples. 18 19 Acknowledgements 20 The UK Biobank data was gathered using the UK Biobank Resource under Application 21 Number 42614. We would like to thank Achilleas Pitsillides for obtaining the UK Biobank allele 22 counts. We would also like to thank Robert Goedman for providing programming insight. We 23 thank Dr. Ferdinand Baer for support of this project. This work was supported by the National 1 Human Genome Research Institute (R35HG011293 and U01HG009080 to AEH and CGR; 2 U01HG009080-05S1 to CGR). 3 4 Methods 5

Input Simulation Datasets 6
For the input simulation dataset, we used haplotypes, legend files (an accompanying 7 variant list), and a recombination map from 1000 Genomes Phase 3 (hg19) 3 . The files were 8 modified to include information at each sequencing base. The recombination map was derived 9 from the combined sample of all ancestries (OMNI) 3 . Links to these resources are in 10 Supplemental Table 8. 11 For the simulation haplotypes, African, East Asian, Non-Finnish European, and South 12 Asian global ancestries were used with admixed African samples (African Caribbeans in 13 Barbados (ACB) and Americans of African ancestry in Southwest Utah (ASW)) excluded. 14 HAPGEN2 simulates biallelic SNVs. Hence, we omitted indels present in the 1000G haplotypes. 15 For multiallelic SNVs, the first alternate allele in the legend file with at least one observed 16 alternate allele was kept.  Table 8). 13 We observed slight discrepancies for some regions in the total number of variants between the 14 gnomAD v2.1 data used for the AFS target data and the gnomAD downsamplings data used for 15 the Number of Variants target data (Supplemental Figure 19). Differences in the number of 16 variants per gene likely arise due to inconsistencies between the two datasets with respect to 17 classification of variant function, removal of variants in overlapping genes, as well as other 18 differences. The discrepancies did not substantially affect the simulation results, as shown when 19 the simulated haplotypes are compared to the gnomAD data (Results). 20 21

Centimorgan Blocks 22
Chromosome 19 was divided into 1 cM blocks for simulation. The cM blocks were 1 defined using the 1000 Genomes Project recombination map estimated from the combined set of 2 all ancestries 3 (Input Simulation Datasets). Blocks were restricted to the coding region of the 3 canonical transcript for each gene. Genes that overlapped multiple blocks or were between cM 4 blocks (the recombination map did not contain information at all bp) were included with the 5 previous cM block. Of the 107 cM blocks, two blocks did not meet the requirement of containing 6 at least two genes (blocks 17 and 23). Additionally, there were four blocks (blocks 8, 50, 57, and 7 92) with fewer than 100 SNVs in at least one ancestry in the gnomAD target data. These six 8 blocks were merged with the preceding adjacent block, resulting in 101 blocks for simulation 9 (Supplemental Table 2). Blocks ranged from 3,183 bp to 81,253 bp (median = 19,029; Q1 = 10 11,037; Q3 = 27,204). 11 12

Implementation of RAREsim 13
RAREsim was implemented for a genetic region of interest using the computing 14 flowchart shown in Supplemental Figure 20. First, the input haplotype and legend files were 15 modified to include all bp, including monomorphic bases. Then, haplotypes were simulated for 16 each cM block using HAPGEN2 13 with default parameters. The relative risk was set to 1.0, and 17 hence no disease loci were simulated. The random seed in HAPGEN is set by time. Therefore, 18 simulation replicates cannot be run in parallel across multiple cores starting at the same time. To 19 avoid this, we simulated replicates for the same simulation scenario on the same computing core 20 in series. Alternatively, the pause Bash command could be used when simulating haplotypes in 21 parallel. Then, the expected number of variants per MAC bin was calculated using the 22 expected_variants RAREsim function with either default simulation parameters or region-23 specific simulation parameters estimated using fit_afs and fit_nvariants functions. Finally, the 1 simulated haplotypes were pruned using the prune_variants function to identify pruning 2 locations and a supplemental Bash script to efficiently prune the identified locations. regions on chromosomes 1, 6, and 9 that were chosen to be representative of the genome 18 17 (Supplemental Table 9). As with the blocks on chromosome 19, the regions were restricted to 18 canonical coding exons. These blocks were each 500 Kb, but when restricted to the coding 19 region were 24,918; 12,519; and 17,051 bp on chromosomes 1, 6, and 9 respectively. 20 Whole genome sequencing data from gnomAD v3 was used to evaluate the performance 21 of default parameters for different sample sizes, and for intergenic regions.

Code Availability 5
RAREsim is an open-source R package, and all code can be found at 6 https://github.com/meganmichelle/RAREsim. A small example simulation with the necessary 7 script is available at https://github.com/meganmichelle/RAREsim_Example. Code to complete 8 the majority of the analyses included here can also found at 9 https://github.com/meganmichelle/RAREsim_Example.