In-situ genomic prediction using low-coverage Nanopore sequencing

Most traits in livestock, crops and humans are polygenic, that is, a large number of loci contribute to genetic variation. Effects at these loci lie along a continuum ranging from common low-effect to rare high-effect variants that cumulatively contribute to the overall phenotype. Statistical methods to calculate the effect of these loci have been developed and can be used to predict phenotypes in new individuals. In agriculture, these methods are used to select superior individuals using genomic breeding values; in humans these methods are used to quantitatively measure an individual’s disease risk, termed polygenic risk scores. Both fields typically use SNP array genotypes for the analysis. Recently, genotyping-by-sequencing has become popular, due to lower cost and greater genome coverage (including structural variants). Oxford Nanopore Technologies’ (ONT) portable sequencers have the potential to combine the benefits genotyping-by-sequencing with portability and decreased turn-around time. This introduces the potential for in-house clinical genetic disease risk screening in humans or calculating genomic breeding values on-farm in agriculture. Here we demonstrate the potential of the later by calculating genomic breeding values for four traits in cattle using low-coverage ONT sequence data and comparing these breeding values to breeding values calculated from SNP arrays. At sequencing coverages between 2X and 4X the correlation between ONT breeding values and SNP array-based breeding values was > 0.92 when imputation was used and > 0.88 when no imputation was used. With an average sequencing coverage of 0.5x the correlation between the two methods was between 0.85 and 0.92 using imputation, depending on the trait. This demonstrates that ONT sequencing has great potential for in clinic or on-farm genomic prediction. Author Summary Genomic prediction is a method that uses a large number of genetic markers to predict complex phenotypes in livestock, crops and humans. Currently the techniques we use to determine genotypes requires complex equipment which can only be used in laboratories. However, Oxford Nanopore Technologies’ have released a portable DNA sequencer, which can genotype a range of organisms in the field. As a result of the device’s higher error rate, it has largely only been considered for specific applications, such as characterising large mutations. Here we demonstrated that despite the devices error rate, accurate genomic prediction is also possible using this portable device. The ability to accurately predict complex phenotypes such as the predisposition to schizophrenia in humans or lifetime fertility in livestock in-situ would decrease the turnaround time and ultimately increase the utility of this method in the human clinical and on-farm settings.

In livestock and crops, genomic predictions are termed genomic estimated breeding values (GEBVs) and are used to increase the accuracy and intensity of selection in a process referred to as genomic selection [6]. The benefits of genomic prediction include more accurate selection of complex traits and a decrease in generation interval, which ultimately leads to accelerated genetic gain, evidenced in the poultry [8] and dairy industries [9]. The method allows for the polygenic nature of complex traits by genotyping a number of markers spread evenly throughout the genome and assigning an effect to each marker.
Today, almost all livestock species have SNP arrays designed to enable genomic selection. In Australia, the turnaround time between sample collection and GEBV results is between 6-8 weeks. This turnaround time has prevented the adoption of genomic selection in Australia's northern beef industry, where cattle are generally only handled once a year for a handful of days. This means management decisions based on the GEBV results from SNP array genotyping cannot be implemented until the following year, when the cattle are handled again. There is also significant demand for faster GEBV turnaround time in the sheep and southern beef industries to allow point of management decisions, for example, allocation of feeding regimes based on genetic potential.
In human genetics, complex disease traits have been the focus of this type of quantitative genetics, although pharmacogenetics also holds some potential. For complex diseases such as bipolar disorder, Type II Diabetes, and Crohn's disease [10] polygenic risk scores are used to evaluate an individual's genetic predisposition to the disease. Poly-genic risk scores are based on the genome wide association study (GWAS) correlation between significant SNP markers and the disease [11]. To-date the largescale utility of polygenic risk scores for individuals has been limited [5]. This has been attributed in part to the current turnaround time, and cost. Instead, family history is used to estimate an individual's predisposition to a particular disease.
However, family history information has limited utility for determining relative risk for complex diseases in some cases. For example, in schizophrenia a family history of the disease is reported in less than a third of cases [12]. A Swedish national study [13] reported family history of the disease in only 3.81% of cases, taking into account first-, second-and third-degree relatives. Moreover, despite half the genetic variance for schizophrenia occurring within family, siblings with a family history of the disease are given the same risk using family history alone. By genotyping the individual, polygenic risk scores far more accurately report an individual's genetic risk than family history alone. In the case of schizophrenia, this information could be used to differentiate between individuals vulnerable to environmental risk factors [14].
Studies exploring the polygenic nature of pharmacological response to chemotherapeutics [15] and asthma treatments [16] have also yielded promising results. Decreasing the turnaround time of genotyping could increase the utility of polygenic risk scores for individuals. For example, a clinician may like to know the genetic risk of a particular disease to assess the priority of diagnostic tests, predict the efficacy of a potential drug treatment, or predict the patient's likelihood of having a severe adverse reaction to a particular treatment, as quickly as possible before the treatment is due to be administered. Decreased turnaround time for poly genic risk scores would also help clinicians diagnose complex diseases in patients while in their prodromal phase, which is vital in early intervention [14].
Genomic prediction in both medicine and agriculture has previously relied heavily on SNP array technology. SNP arrays are a low-cost method to genotype thousands to hundreds-of-thousands of genetic markers. Once costing $400 USD for 10,000 SNP genotypes, SNP arrays now cost less than $50 USD for 50,000-800,000 SNP genotypes. However, the technology requires large, expensive laboratory equipment. Therefore, the turnaround time is, at a very optimistic minimum, the time for a sample to reach the laboratory.
Recently, there has been a rise in the popularity of genotyping-by-sequencing. A widely applied approach is to use restriction enzymes to reduce the complexity of genomes for low-coverage sequencing on short-read sequencing platforms [17]. This method is not only becoming cheaper than SNP array genotyping, but it also allows for simultaneous genome wide marker discovery. These advantages have allowed an increased number of traits and associated variants to be studied [4], which has resulted in the ability to accurately predict many new complex traits from genotypes. Still, genotyping-by-sequencing is laboratory based and requires expensive equipment. A potential solution to reduce the turnaround time of genomic prediction while incorporating the benefits of genotypingby-sequencing, is to use Oxford Nanopore Technologies' (ONT) portable nucleotide sequencer, the The MinION could be used to sequence samples in-situ for genomic prediction in livestock and human settings. Despite initial reports of poor sequencing accuracy and yield, steady developments in flow cell chemistry, library preparation and base calling algorithms has seen reported sequencing yields increase from less than 3 GB to greater than 40 GB and sequencing accuracy increase from 68.4% [19] to over 98% [20,21]. A major advantage of ONT sequencing technology is the ability for ONT reads to map more accurately to complex genomic regions [22]. This is largely a result of the read length produced by the technology, which has no theoretical limit. This reduces read mapping bias [23] and eliminates the need for restriction enzyme digestion used for genotyping-by-sequencing with short read sequencing technology.
The aim of this study was to evaluate the accuracy of genomic predictions calculated from lowcoverage ONT sequence data and compare the results to SNP array-based predictions. We calculated correlations and prediction bias for ONT genomic predictions against the SNP array predictions for various sequencing depths (4x, 2x, 1x, 0.5x) and tested three different methods of imputing missing genotypes. Our results suggest ONT's MinION sequencer could be a useful tool for in-situ genomic prediction, with human clinical or on-farm applications. On average 86.4% ± 0.6 of reads were effective (i.e., mapping quality Phred score > 0). For each animal, a random subset of sequence data representing 4x, 2x, 1x, and 0.5x sequencing coverage of the 2.7 Gb bovine genome was used for genomic predictions.

Sequencing
A minimum allele observation based genotyping approach was used to genotype 641,163 SNP markers. This method accounted for the random probability of sampling alleles in a diploid species by grouping loci with similar coverage together and calling genotypes using a minimum allele count specific to each coverage group. Three methods (non-imputed, imputed-AF & imputed-Beagle) were tested to genotype loci with less than one overlapping ONT read (where only one read was observed 8 all heterozygous loci were called as homozygous regardless). Although the first two methods are obviously not accurate, they are very fast which could be useful on farm or in a clinical setting. Before imputation, at 0.5x sequencing coverage 64% of ONT genotypes were correctly called, 34% were incorrect in one allele (i.e., called homozygous rather than heterozygous) and 0.02% were incorrect in both alleles (opposing homozygous; Figure 1). At 4x sequencing coverage 74% of calls were correctly called in both alleles, 24% were called incorrectly in one allele and 2% were called incorrectly in both alleles.
Figure 1: Proportion of genotype calls with both alleles correct (2), one allele incorrect (1) and two alleles incorrect (0) at different sequencing coverages. Two correctly called alleles (green) indicates no difference between the ONT genotype and SNP array genotype. One correctly called allele (blue) indicates one allele of the ONT genotype is correct and the other is incorrect. If both alleles are incorrect the (i.e., the alternate homozygous has been called) the number of correctly called alleles is zero (red).

Marker Effects
Marker effects (BLUP solutions) for predicting GEBV were derived from 26,145 female cattle The non-imputed genotyping method, where uncalled loci were assigned a homozygous reference 3 genotype, was as expected the least accurate for genomic prediction across all traits ( Figure 3). The 4 ONT non-imputed genomic predictions had reasonable correlations with the SNP array predictions 5 from 4x down to 1x sequencing coverage (0.98 -0.95 at 4x and 0.95 -0.85 at 1x). These correlations 6 dropped significantly at 0.5x sequencing coverage, 0.65 for body weight, 0.81 for CL600, 0.72 for body 7 condition score and 0.69 for hip height (Figure 3). For sequencing coverages below 4x the prediction 8 bias (regression coefficient of SNP array GEBV on ONT GEBV) for the non-imputed method was 9 consistently greater than 1, indicating the ONT GEBV are under-estimating compared to the SNP array 10 GEBV. At 2x sequencing coverage the bias ranged between 1.1 for body weight and 1.54 for body 11 condition score (Figure 4; Table 2). At 0.5x sequencing coverage the bias for these two traits increased 12 to 1.45 for body weight and 1.79 for body condition score. For the CL600 trait the bias at 0.5x was 2.7. 13 14 Genomic prediction accuracies based on imputed-AF genotypes 15 The correlation coefficients for the imputed-AF method, which used the population allele frequency 16 to genotype missing loci, were the highest of the three methods. At 4x sequencing coverage the 17 correlations were 0.99 for body condition score & CL600, 0.96 for body weight and 0.97 for hip height. 18 This dropped slightly at 2x sequencing coverage, down to 0.95 for body weight, 0.98 for CL600, 0.97 19 for body condition score and 0.95 for hip height. The correlations remained above 0.9 for all traits at 20 1x sequencing coverage except for hip height, which was 0.88. At 0.5x sequencing coverage the 21 correlations for this method remained above 0.8 with the lowest being 0.84 for body condition score. 22 The prediction bias for this method was consistently greater than 1, ranging between 1.65 for hip 23 height and 1.96 for body condition score at 1x sequencing coverage. This increased to between 2.2 24 and 3.1 for hip height and CL600 respectively at the lowest sequencing coverage. 25

Genomic prediction accuracies based on imputed-Beagle genotypes 27
The final genotyping method, which used the imputation package Beagle v5.1 [25], had correlations 28 greater than 0.85 for all traits at sequencing coverages as low as 0.5x. At 4x sequencing coverage the 29 correlations for body weight, CL600, body condition score and hip height were 0.97, 0.99, 0.98 and 30 0.98 respectively. The correlations decreased slightly with the decrease in sequencing coverage to 31 0.85 for body weight, 0.91 for CL600, 0.89 for body condition score and 0.85 for hip height at 0.5x 32 sequencing coverage. Although the correlations for this genotyping approach were, on average, not 33 as high as the imputed-AF approach, the prediction bias was significantly reduced. The prediction bias 34 at 4x sequencing coverage was around 1 for all traits and decreased slightly as the sequencing 35 coverage decreased. The decrease in prediction bias was most notable in the body weight and hip 36 height traits, which had a bias of 0.88 and 0.74 respectively at 0.5x sequencing coverage. For the same 37 sequencing coverage CL600 and body condition score has a bias of 1.02 and 1.06. Nonimputed Imputed-AF Imputed-Beagle  to compare genomic predictions from ONT data to SNP array genotyping, the current standard for 50 genomic prediction. We investigated the accuracy of the ONT genomic predictions at various 51 sequencing coverages (4x, 2x, 1x, and 0.5x) as well as using three different methods to genotype loci 52 with no sequencing coverage: non-imputed, imputed-AF and imputed-Beagle. Beagle The non-imputed genotyping method performed significantly worse than either of the two imputation 90 methods. This is likely because the missing genotypes were assigned as homozygous reference and 91 therefore received no marker effect, because a homozygous reference is coded as a 0 in the genotype 92 matrix, thereby giving 0 × the alternative allele effect. As the average sequencing coverage decreased 93 the number of missing loci increased and therefore the breeding values were under-estimated at low 94 GEBV for the non-imputed method as the sequencing coverage decreased. While not unexpected, this 96 phenomenon has important implications: individuals with deeper sequencing coverage will have 97 higher genomic prediction values, as more of their loci will be represented as non-zero in the genotype 98 matrix. Therefor this method is unsuitable for any circumstance where the sequencing coverage is not 99 tightly controlled. [36] who investigated the effect of imputation error on genomic predictions in 3,494 dairy cows 106 genotyped on a low-density SNP array. They concluded this type of bias was an artifact of imputation 107 algorithms suggesting the most frequent haplotype within a population whenever a haplotype cannot 108 be observed unambiguously. A similar bias was likely seen here because assigning the missing 109 genotypes using allele frequencies (i.e., the average genotype within the population) has regressed 110 the breeding value of each animal toward the population mean. At high coverages (4x and 2x) this is 111 less prominent because there are fewer missing genotypes. However, at the low coverages (1x and 112 0.5x) a significant number of markers are missing ONT genotypes and therefore the regression toward 113 the mean for animals is more prominent. Therefore, this method is likely only useful when within-114 group rankings are of interest, and not absolute predictions: for example, if a producer wanted to 115 identify the top 30% of animals for a certain trait within their herd. 116

117
The Beagle [25] imputation method for genotyping missing loci was the overall most accurate method 118 when considering both correlation and prediction bias. Although the correlations for some traits were 119 not as high as using the allele frequencies, the prediction bias was consistently close to 1 across all likely because imputation with Beagle [25] uses linkage disequilibrium between genotyped markers 122 to establish haplotypes. This means that the missing genotypes are imputed using the full extent of 123 available information from the genotyped data (i.e., nearby markers genotyped with ONT reads). This 124 approach is very different to imputing the missing genotypes using the allele frequency without 125 considering nearby ONT genotyped markers. Imputation methods such as this are likely the most 126 useful approach, despite the added time required to run the imputation program, which is currently 127 substantially less time than required to produce the sequence data. have reported here and may allow for the calculation of accurate genomic predictions from ultra-low 146 (less than 0.5X) sequence coverage, making the method more cost effective. 147 148 Targeted sequencing methods also hold enormous potential to increase the accuracy of genomic 149 prediction using ONT sequencing. A promising method for in-situ applications is ONT's adaptive 150 sequencing, which can target loci within a genome for sequencing without molecular intervention [41, 151 42]. This method only requires the location of genetic markers and no additional laboratory steps, 152 therefore would be ideal for on-farm or clinical applications of genomic prediction with the MinION 153 [41,42]. Three methods were used to assign genotypes at loci with less than 2x coverage: non-imputed, 241 imputed-AF and imputed-Beagle. The non-imputed approach assigned missing loci as homozygous 242 reference, coded as 0 in the genotype matrix. The imputed-AF method used the allele frequency 243 within the population, calculated from SNP array data, to genotype the missing loci. The sum of the 244 probabilities of each genotype given Hardy-Weinberg equilibrium multiplied by the genotype codes, 245 0 for homozygous reference, 1 for heterozygous and 2 or homozygous alternate were used to fill 246 missing genotypes. Using continuous genotype values between 0 and 2 for this method allowed some 247 accounting for the uncertainty of genotyping from low coverage sequence data. Finally, the imputed- Linear models were used to evaluate the correlation between â and b, being the estimated genetic 279 value derived from ONT and SNP array, respectively. The correlation between â and b was reported 280 as well as the prediction bias (regression coefficient of the SNP array-GEBV on the ONT-GEBV). 281 282 Linear models were also used to evaluate the effect of read length, base quality, effective mapping 283 percentage and number of reads mapping with Phred mapping quality 0 on ONT genomic prediction 284 accuracy. To evaluate each covariate, y was calculated as the residuals from the model b ̂~ â + e, 285 where â was an array of SNP-array based genomic predictions (from above), and b ̂ was an array of 286 ONT-based genomics predictions (from above). The significance of each covariate (read length, base 287 quality, effective mapping percentage or number of reads mapping with Phred mapping quality 0) on 288 the genomic prediction accuracy from ONT sequence data was modelled using: 289 ~ + e 290 Where was an array of the covariate values (read length, base quality, effective mapping percentage 291 or number of reads mapping with Phred mapping quality 0) as a random effect; and y was the 292 difference between the SNP-based genomic prediction and the ONT-based genomic prediction. 293 294 295