Genome-wide association, prediction and heritability in bacteria

Advances in whole-genome genotyping and sequencing have allowed genome-wide analyses of association, prediction and heritability in many organisms. However, the application of such analyses to bacteria is still in its infancy, being limited by difficulties including the plasticity of bacterial genomes and their strong population structure. Here we propose a suite of genome-wide analyses for bacteria that combines methods from human genetics and previous bacterial studies, including linear mixed models, elastic net and LD-score regression. We introduce innovations such as frequency-based allele coding, testing for both insertion/deletion and nucleotide effects and partitioning heritability by genome region. Using a previously-published large cohort study, we analyse three phenotypes of a major human pathogen Streptococcus pneumoniae, including the first analyses of minimum inhibitory concentrations (MIC) for each of two antibiotics, penicillin and ceftriaxone. We show that these are very highly heritable leading to high prediction accuracy, which is explained by many genetic associations identified under good control of population structure effects. In the case of ceftriaxone MIC, these results are surprising because none of the isolates was resistant according to the inhibition zone diameter threshold. We estimate that just over half of the heritability of penicillin MIC is explained by a known drug-resistance region, which also contributes around a quarter of the heritability of ceftriaxone MIC. For the within-host survival phenotype carriage duration, no reliable associations were found but we observed moderate heritability and prediction accuracy, indicating a polygenic trait. While generating important new results for S. pneumoniae, we have critically assessed existing methods and introduced innovations that will be useful for future large-scale population genomics studies to help decipher the genetic architecture of bacterial traits. Author summary Genome-wide association, prediction and heritability analyses in bacteria are beginning to help unravel the genetic underpinnings of traits such as antimicrobial resistance, virulence, within-host survival and transmissibility. Progress to date is limited by challenges including the effects of strong population structure and variable recombination, and the many gaps in sequence alignments including the absence of entire genes in many isolates. More work is required to critically asses and develop methods for bacterial genomics. We address this task here, using a range of existing methods from bacterial and human genetics, such as linear mixed models, elastic net and LD-score regression. We adapt these methods to introduce new analyses, including separate assessment of gap and nucleotide effects, a new allele coding for association analyses and a method to partition heritability into genome regions. We analyse within-host survival and two antimicrobial response traits of Streptococcus pneumoniae, identifying many novel associations while demonstrating good control of population structure and accurate prediction. We present both new results for an important pathogen and methodological advances that will be useful in guiding future studies in bacterial population genomics.

The ability to perform genome-wide analyses of DNA variations has enabled detailed virulent serotypes. Several population genomic studies have characterized central epidemiological traits of the pneumococcus, including duration of carriage and 20 resistance to commonly used antibiotics. 21 In a pioneering study, Lees et al. [3], found high heritability of the duration of carriage of S. pneumoniae in human hosts. Furthermore, the strong genetic control of 23 the binary trait antimicrobial resistance (AMR) is well established from genome-wide 24 association studies (GWAS) [5][6][7][8]. The quantitative trait minimum inhibitory 25 concentration (MIC) has previously been studied in Mycobacterium tuberculosis [9], but 26 not in S. pneumoniae. 27 We critically assess available methods for association, prediction and heritability 28 analyses, and propose novel developments, which we use to investigate carriage duration 29 (CD), ceftriaxone MIC and penicillin MIC in S. pneumoniae, finding many new 30 associations and high predictive accuracy for the two MIC traits. Given the increasing 31 availability of large-scale bacterial GWAS, the developments presented here will provide 32 a useful guide to future studies. 33 Materials and methods 34 Source of data 35 The present study is based on nasopharyngeal swab data collected monthly from infants 36 and their mothers in the Maela refugee camp in Thailand between 2007 and 2010 [10].
Following [3], we implemented a hidden Markov model, using the R package msm [14], to obtain maximum-likelihood estimates of CD values. Due to differences in immune 48 response to bacterial infections between adults and infants [15], only data from infants 49 were used for CD analyses, but we analysed all MIC values regardless of the host. To 50 obtain approximate normal distributions, we log-transformed all three phenotypes (see 51 S1 Fig for histograms). 52 Preparation of genetic data 53 We used a published dataset [5] of high quality genome sequences from 2 663 isolates, 54 manually selected and aligned to the ATCC700669 reference genome using the snippy 55 pipeline version 4.4.0 [16], with minimum coverage set at the default 10 reads. Of these, 56 1 612 isolates were sampled during S. pneumoniae positive episodes, on average 1.5 (SD 57 1.0) isolates per episode. For the 337 episodes represented by > 1 genome sequence, we 58 used the sequence from the last isolate sampled. This resulted in 1 047 sequenced CD 59 episodes in 370 host infants (mean 2.8, SD 1.9 episodes per host). The median CD was 60 64 days, with mean 110 and SD 102. MIC data for both penicillin (mean 0.57, SD 0.48 61 µg ml −1 ) and ceftriaxone (mean 0.36, SD 0.28 µg ml −1 ) were available for 1 332 isolates, 62 of which 554 also have a CD episode. SNP-sites version 2.5.1 [17] and VCFtools version 63 0.1.16 [18] were used to identify 239 176 variant sites in the CD dataset, and 215 892 in 64 the MIC dataset. Mapping of association hits to the ATCC700669 reference genome Working inwards from the outer circle showing basepair positions along the genome, the subsequent circles show the distributions of gap and minor allele frequencies in the MIC dataset, annotated core genes (in black), and SNPs associated with ceftriaxone MIC and penicillin MIC according to the gap test (blue) and SNP test (red). Figure prepared using circos [21]. leads to strong population structure, which is challenging for association 117 analyses [22,23]. Population structure refers to groups of individuals (sub-populations) 118 with greater genetic similarity among them than with other individuals, which causes 119 genome-wide genetic correlations that can confound association signals. Sub-populations 120 may also differ in environmental exposures, which can compound the problem.

121
There is no complete solution to the problems caused by population structure, and 122 attempts to address them risk discarding true as well as spurious signal. Most 123 approaches introduce either covariates or a genetic random effect into association 124 models to absorb signals that can be explained by population structure, which then do 125 not contribute to association statistics. The variance-covariance matrix G of a genetic 126 random effect is assumed known a priori based on measures of similarity between pairs 127 of sequences.

128
Sequence clusters can be used to define either G, via cluster distances, or population 129 structure covariates via indicators of cluster membership. Clustering can proceed by 130 constructing a phylogenetic tree that models the evolutionary history of the 131 sequences [24], with nodes of the tree used as cluster identifiers and branch lengths used 132 to define cluster distances. We inferred maximum-likelihood phylogenies of both CD 133 and MIC datasets using IQTree version 2.0.6 [25] under the general time reversible 134 model, with discrete Gamma (+G option) base substitution rates across sites (Fig. 2). 135 The model assumes no recombination, which is false for S. pneumoniae, and 136 consequently the usefulness of the resulting phylogeny has been questioned [26].

137
FastBAPS, which extends hierBAPS, [30][31][32] was also used to cluster the isolates, 138 without reference to a phylogeny. This approach generates an initial clustering using 139 between-variant pairwise distances based on Ward's method [33], then an optimal set of 140 clusters is identified using Bayesian hierarchical clustering [34].

141
In human studies, G was in the past computed from known pedigrees [35] and now 142 usually as a genome-wide average allelic correlation [36]. For bacteria, G can be defined 143 using allelic correlations under any 1 df allele coding. Despite the success of this . Plots generated after midpoint rooting using R packages ape [27], phytools [28] and ggtree [29].
approach in human studies, our preliminary analyses could not identify an allele coding 145 that led to good control of population structure effects, although using the gap 146 presence/absence binary indicator gave the best results among those we tried.

147
Conversely, despite the questionable validity of the phylogeny due to it ignoring 148 recombination, defining G in terms of lengths of shared phylogenetic branches [37] led 149 to good control of population structure, as evidenced by QQ plots.

150
Linear mixed model (LMM) analyses 151 We wish to test b j = 0 within the LMM [38]: where y is a length-n phenotype vector, x j is the vector encoding alleles at the jth 153 variant, and u and are random vectors of genetic and environmental effects, with I the 154 n × n identity matrix.

155
Pyseer [39] has recently been widely used in bacterial GWAS, and an extensive 156 summary of its models with performance benchmarking is available [40]. The Pyseer the gap sequences at each SNP test. To overcome this problem, we used a two-stage 161 LMM/GLS pipeline for the SNP test, similar to EMMAX [42], in which the phenotype 162 for association testing was the residual from fitting (1) with b j = 0. This first 'LMM' 163 stage was performed using lme4qtl [35]. The b j were then estimated in a second stage 164 using generalised least squares regression (GLS). In the CD analyses for the SNP test, 165 we were able to incorporate an extra random effect to model shared host in the 166 LMM/GLS pipeline, but for the gap and major-allele tests performed using 167 Pyseer-LMM, this was replaced by a binary covariate indicating previous carriage. 168 Accessory genome genes were tested using the LMM/GLS pipeline, with a single test 169 based on standardised gene counts.  For each test, a significance threshold is estimated from null simulations of genetic data 185 at 10 times as many sites as the observed dataset.
The Pyseer wg-enet prediction model is based on glmnet [45]. to have mean zero, the ith phenotype value is predicted byb T x i , where x i is the vector 196 of allele indicators for the ith sequence, and We use cross-validation (CV) to optimise λ, which controls the penalty on large b values. 198 When λ = 0 we have weighted least-squares regression, while increasing λ introduces 199 bias to reduce overfitting. By default, both Pyseer and our pipeline set α = 0.01.

200
Although this value is close to that for ridge regression (α = 0), which retains all 201 predictors in the model, it is large enough that only about 10% ofb entries are non-zero. 202 Ten-fold (10F) and leave-one-strain-out (LOSO) [46] CV were used to assess 203 prediction accuracy. Whereas 10F selects the training sets randomly, which can lead to 204 instances of high similarity between test and training sequences, LOSO is a more the proportion of phenotype variance explained by the model [46].
We also estimate h 2 using a modification of the LDSC model [47]: Here, S j is the association test statistic at variant j, and r jk is the sample correlation of 214 frequency-based allele codes at variants j and k (or gene counts for the accessory 215 genome). Following [48], prior to computing pairwise Pearson correlation coefficients we 216 further transformed the allele codes using Gaussian quantile normalisation.

217
The score l j involves a sum over the whole genome. In human genetics applications 218 only a neighbourhood of j is included, but the presence of genome-wide LD in S.

219
pneumoniae makes it difficult to define a suitable neighbourhood. The definition of l j 220 also incorporates a bias adjustment [47] that can lead to l j < 0, but typically l j 1. To 221 account for heteroskedasticity and correlations among the S j , the least-squares 222 estimation of A and h 2 g in (3) used weights 1/max(1, l j ).

223
When choosing the testing method to generate the S j for LDSC, we found that the 224 very strong population structure effects distort the LDSC regression relationship in the 225 absence of any adjustment, yet a fully effective adjustment for population structure was 226 also unsatisfactory because it removed informative signal. The best compromise that we 227 could identify between inadequate control for population structure effects and loss of 228 association signal with effective control, was to compute the major-allele test statistic 229 S j in the fixed effect model (FEM): where v is the first principal component (PC) of the sequence distances (explaining 231 > 90% of genetic variation) and a is the corresponding effect size. For the CD analyses, 232 we also included the previous carriage covariate in (4). We note again that v does not 233 remove all population structure effects and the S j tend to be inflated, but this is not Code is available at https://github.com/Sudaraka88/bacterial-heritability and 251 access details for the genetic data are provided in S1 File.

253
Carriage duration (CD)  Despite the lack of associations for CD, prediction accuracy ( the previously-reported phage hit based on k-mer analysis [46]. However, our 4 hits are 273 due to the same 15 isolates, of which 6 are from the same long (517 day) episode (see 274 detailed results in S1 Appendix). Furthermore, when the all-isolates dataset was 275 analysed using treeWAS, 9 associations were identified (see S3 Appendix), but these did 276 not include purF and polA (reported above) nor the region identified in our LMM 277 analyses. We conclude that we are unable to reliably identify individual associations for 278 CD, but there is good evidence for it being a moderately-heritable polygenic trait.  with these highly-heritable, polygenic traits. For ceftriaxone MIC, the largest statistics are of similar magnitude for gap and SNP 292 tests (Fig. 6), but for low allele frequencies there are few large gap statistics and many 293 large SNP statistics, suggesting that there are few rare deletions, but many rare

307
As expected from the large number of associations, prediction accuracy for both 308 MIC phenotypes is very high under 10F CV (Table 1), but less so for LOSO CV, with 309 high SE values for penicillin MIC indicating hard-to-predict clusters (S14 Fig). 310 The LMM and wg-enetĥ 2 agree closely across the two MIC phenotypes. The 311 wg-enet values are about 5% lower ( true phenotype under 10F and LOSO CV, respectively).

341
The innovations in our association analysis pipeline include separate testing of gap 342 and SNP effects, with a permutation approach to control FWER and frequency-based 343 allele coding. This approach performed better than the alternatives of major-allele 344 LMM and treeWAS tests, detecting more associations under good control of population 345 structure effects.

346
Our phenotype prediction analysis used frequency-coded variants within a 347 glmnet-based whole genome elastic net model.

348
The previous analysis on CD using data from the same study [3], provided a 349 lower-bound h 2 estimate of 0.45 using warped-lmm [51], concluding that CD is a highly 350 heritable trait. Our estimates are lower, which may be due to our use of only one isolate 351 per CD episode (S1 Appendix). where the largest reduction in h 2 (measured using GEMMA [52]) was only 27% [9], 361 which is close to our result for ceftriaxone MIC.

362
Our results support the use of linear mixed models for association analysis, with 363 separate testing of gap and SNP effects, the latter using frequency-based allele coding. 364 We also support the use of wg-enet for prediction of quantitative traits and we find that 365 LDSC performs well for heritability analyses but further work is required to assess 366 optimal strategies for dealing with strong population structure in bacterial genomes.  Table 2.   Table 1 and S2 Appendix are computed using all 438 values shown here.

439
(PDF) 440 S1 File. Metadata for S. penumoniae isolate reads used in this study.