Geographic migration and vaccine-induced fitness changes of Streptococcus pneumoniae

Streptococcus pneumoniae is a leading cause of pneumonia and meningitis worldwide. Many different serotypes co-circulate endemically in any one location. The extent and mechanisms of spread, and vaccine-driven changes in fitness and antimicrobial resistance (AMR), remain largely unquantified. Using geolocated genome sequences from South Africa (N=6910, 2000–2014) we developed models to reconstruct spread, pairing detailed human mobility data and genomic data. Separately we estimated the population level changes in fitness of strains that are (vaccine type, VT) and are not (non-vaccine type, NVT) included in the vaccine, first implemented in 2009, as well as differences in strain fitness between those that are and are not resistant to penicillin. We estimated that pneumococci only become homogenously mixed across South Africa after about 50 years of transmission, with the slow spread driven by the focal nature of human mobility. Further, in the years following vaccine implementation the relative fitness of NVT compared to VT strains increased (RR: 1.29 [95% CI 1.20–1.37]) – with an increasing proportion of these NVT strains becoming penicillin resistant. Our findings point to highly entrenched, slow transmission and indicate that initial vaccine-linked decreases in AMR may be transient.


by the Wits Vaccines and Infectious Diseases
Analytics Research Unit (Wits-VIDA). A random sample of 400 carriage isolates from each year were chosen for genome sequencing (Table S2). We included both carriage and invasive disease isolates and conducted sensitivity analyses throughout to confirm the methods were robust to both carriage and invasive disease individually. Invasive disease is defined as the bacterium being isolated from a typically sterile site. The majority of total isolates were from children aged <5 years (80.5%), while 13.8% were from children aged 5-18 years and only 5.7% were from adults (>18 years). These were distributed across the pre-and post-PCV periods. We additionally included similar GPSCs from the GPS database (for context within the global population) for the relative risk analysis. We utilized metadata including collection year and month, residence province of the patient, age of the patient, sampling site, and clinical manifestation. The range of sampling sites include nasopharyngeal swabs (for carriage); blood, pleural fluid, cerebrospinal fluid, peritoneal fluid, pus, and other joint fluid (for invasive disease).

Population Data
We estimated the population for each municipality (N=234) across South Africa using the populations-size estimates from LandScan 2017 (30,31).

Mobility Data
The mobility data used in this study was collected using Meta Data for Good Disaster maps from South Africa. These are initiated at the onset of a disaster -in this case the SARS-CoV2 pandemic and track the geographic movement of Meta users(32). We used the baseline (adjusting to 2 weeks prior) human mobility for each month from January 2020 to July 2021(32, 33) to attain a mobility probability from and to each municipality. For each origin (home) municipality (N=234), we determined the mean monthly number of Meta users that were in each destination municipality in South Africa. Location pairs with a value of zero were given a value of 10. We divided each cell by the total number of users across all destinations for that origin municipality. Each cell in the resultant original-destination matrix therefore represented the probability of being in each destination municipality, given your home municipality. This is the probability of mobility from each municipality to each other municipality after 2019. Because we don't have mobility data from the exact years the genomes were sampled we adjust the diagonal of the mobility matrix using an estimated parameter. This allows people to stay more or less at home and mitigates the impact of non-year matched mobility data.

Generation Time Distribution
We used a simulation framework to estimate the overall generation time using the separate contributions of the carriage durations and the incubation period. This approach has previously been employed for other pathogens(34).
We sample 1000 carriage durations from an exponential distribution with means which are inverse to the clearance rates estimated in Chaguza, C., et al. 2021(20)  To sample the day of transmission, we then randomly sample a time point between 0 and the time of clearance for each individual. We then separately sampled an incubation period using a uniform distribution of between 1 and 5 days. To account for longer carriages resulting in more opportunities for transmission, we sampled from the distribution of generation, with the probability of sampling each individual weighted by the total carriage duration. The total generation time is then the sum of the duration to transmission and the incubation period.
We repeat these steps 10,000 times and estimate the mean and standard deviation of this final distribution assuming that the generation time follows a gamma distribution with mean 35 and standard deviation 35 (exponential distribution with a rate of 1/0.096). By comparing the histogram of the distribution with the gamma distribution, this appears a reasonable assumption (See Figure   S6).

Sample culture and genome sequencing
These pneumococcal isolates were selectively cultured on BD Trypticase Soy Agar II with 5% sheep blood (Beckton Dickinson, Heidelberg, Germany) and incubated overnight at 37°C in 5% CO2. Genomic DNA was then extracted manually using a modified QIAamp DNA Mini Kit (QIAGEN, Inc., Valencia, CA) protocol. As part of GPS, pneumococcal isolates were wholegenome sequenced on the Illumina HiSeq platform to produce paired-end reads with an average of 100-125 bases in length and data were deposited in the European Nucleotide Database. Whole genome sequence data was processed as previously described (25).

Constructing time resolved phylogenetic trees
We selected the GPSCs for which we had genomes from each of South Africa's nine provinces and for which we had a minimum of 50 sequences in total to build phylogenies, henceforth referred to as "Dominant GPSCs". There were nine dominant GPSCs. We created reference genomes for each GPSC using ABACAS to order the contigs from a representative of each GPSC mapped to Streptococcus pneumoniae (strain ATCC 700669/Spain 23F-1) [EMBL accession: FM211187].
Any contigs which did not align were concatenated to the end. We multiply mapped all genomes from each Dominant GPSC against these references respectively using a custom mapping, variant calling, and local realignment around indels pipeline using bwa-MEM(39) and samtools mpileup(40). We built trees masking recombination regions using Gubbins(41) with RAxML (42) and a GTR model. We converted branch length to time using BactDating with a mixed gamma, relaxed clock model (43). We compared concordance between BEAST(44) with both strict and relaxed clocks, and a Bayesian skyline prior. As the results were concordant, we used BactDating due to its shorter runtime ( Figure S17).

Risk Ratio Framework
We used a risk ratio framework to investigate the risk of genetic similarity across geographic distance(45). We compare the location (loc) and label (G) (i.e., GPSC or genetic similarity) of pairs of sequences that were collected around the same time (t). This approach has been shown to be robust to substantial biases in timing and location of isolate collection(45). We first constructed pairwise matrices comparing every isolate to every other isolate (N Pairs=6910). In the numerator was the ratio of pairs which were the same GPSC, collected within a year of each other, from the same province, over the total number of pairs collected within a year of each other from the same province. The denominator was the ratio of pairs which were the same GPSC, collected within a year of each other, from distant provinces (>1000km apart)(Lref), over the total number of pairs collected within a year of each other from distant provinces.
Geographic distances were calculated based on the centroid coordinates of each province. To demonstrate the suitability of using centroid distances, we simulated a spatial transmission process for 1000 separate chains where at each generation a daughter point is placed at a randomly located location 350m in each of the x and y direction. This is repeated over 20 generations. We then identify the 'centroid' of each case based on the closest coordinate rounded to the nearest kilometre. We then calculate the total distance covered for both the true distance and the centroid distances and found that the resulting distances are very similar ( Figure S18).
To quantify uncertainty, we used a bootstrapping approach where in each bootstrap iteration we randomly sampled with replacement the isolates before recalculating the statistic. We report the 2.5 and 97.5 percentiles from the resulting distribution.
We also repeated the same analysis but used the time-resolved phylogenetic trees to interrogate pairs across increasing divergence times (breaking the GPSCs into higher resolution). For this, rather than matrices designating whether pairs were the same GPSC or different GPSCs, the divergence time between each pair was included. We only included the divergence times between like-GPSCs ( Figure 2C-E). We utilized the framework to compare a range of geographic distances keeping the reference distance to pairs which were >1000km apart ( Figure 2).
To identify the divergence time at which pairs had an equal risk of being in the same province as distant provinces (time to homogenization across South Africa) we investigated the divergence time at which there was no increased risk of similarity within a province compared against distant provinces (Relative Risk = 1). We stratified distances in South Africa into groups of distances including the 9 within province, 14 pairs which were <500km apart, 16 pairs 500-1000 km apart, and 6 pairs provinces >1000km apart. We repeated the framework across rolling 20-year time windows at 10-year intervals from 0-100 years (g1,g2)( Figure S5A). We repeated this for pairs where one was from South Africa and the other was from another country in Africa ( Figure S5B).
To quantify uncertainty, we used a bootstrapping approach where in each bootstrap iteration we randomly sampled a tree, and then sampled with replacement the sequences before recalculating the statistic. We report the 2.5 and 97.5 percentiles from the resulting distribution.
We repeated all analysis with sub-sampling to 300 to compensate for biased sampling, and only including pairs isolated from patients with pneumococcal disease ( Figure S4).

Overall Strategy
We extend a mechanistic phylogeographic model previously employed by Salje et. al, 2021(16) to estimate the mobility of the pneumococcus between pairs of municipalities at each transmission step. To infer the probable path of transmission between sequence pairs we used the divergence time and the generation time distribution to estimate the number of transmission generations between pairs of sequences. Each generation is a possible transmission event and provides an opportunity for a mobility event.
The approach ultimately aims to estimate an origin destination matrix for a transmission step, where each cell represents the probability that the pneumococcus is now in location j after one transmission step given it was previously in location i. As the phylogenetic trees combined with the generation time provide an estimate of how many transmission steps separate pairs of samples, we can use repeated matrix multiplication to integrate over all possible pathways linking two locations (see below for more details). We incorporate the probability of sampling at each geographic location, within each collection year, by GPSC to account for our observation process.

Notation
We follow notation implied by Salje et. al, 2021(16). A pair of isolates, Ca and Cb, with sequences, Seqa and Seqb included in a phylogeny are found in locations, La and Lb, and the samples were taken in the years, Ta and Tb. The inferred MRCA between Ca and Cb is time Tm and located in Lm. The number of transmission generation from the MRCA to Ca is Ga and to Cb is Gb.

Single Transmission Generation
Considering a single transmission generation, the probability that persons i and j come into contact with each other given i lives in location a and j lives in location b can be written as: Where C(M 0 = N |= 0 = 8) is the probability that individual i, whose home location is in a, visits location k and CSM 1 = N T= 1 = L) is the probability that individual j whose home location is in b visits location k, and Bk is the location-specific probability of transmission for k.
At time U, one infector in i in location a is expected to transmit to this number of persons in location b: Equation 3.
The total number of people infected by the infector is: Conditional on transmission occurring, the probability that the infectee's isolate is taken in location b is:

Human Mobility Characterisation
We use Meta mobility data (MetMob), as described in the mobility data section above, to characterise mobility between the 234 municipalities of South Africa. We aggregate these to province level (N=9) to fit the model. Initially we extract a 234X234 matrix which sets out the probability that an individual from municipality a visits location k; again where k=any municipality.
MetMob comes from individuals using Meta however this may not be representative of the amount of time spent at home by those involved in pneumococcus transmission. The mean of the diagonal of the Meta Mobility matrix (234X234) is 0.9893 implying, on average, 98.9% of Meta Users stay in their home municipality, Hi. To allow for individuals to spend more or less time at home than represented in the MetMob data we incorporated a parameter to adjust the probability of being at home (t). We adjust the probability of staying home using a standard logistic function and restrict it with bounds of -0.04 and 0.6 to facilitate exploration of a sensible space. The adjustment allowed by the bounds limits the range of movement Hi-0.6 and Hi+0.04.
The probability of a person remaining in location a therefore becomes: Equation 6.
) ⋅ (0.6 + 0.04)) Where a value of greater than 1.0 is obtained for a specific municipality, this is replaced by a value of 0.999.
We rescale the probabilities so that the sum of all mobility is equal to 1: Equation 7.
Where MetMob [a,k] considers all mobility probabilities from the Meta Mobility data.
The sum of the movements to South African municipalities is equal to 1 which assumes that the analysis contains all possible movements of both the pneumococcus and people implying no external introductions. The outcome of this is that some mobility may be missed especially around the country borders.

Probability of the pneumococcus being in each location after G transmission generations
To determine the probability that location k contains the home location after G transmission generations, we use matrix multiplication which integrates across all possible pathways connecting two locations. Where tr is the time of generation Gr.

Probability of observing a pair of cases in two specific locations.
The probability that CA has home location LA and CB has home location LB is conditional on the sequences being observed in locations LA and LB at times TA and TB. We assume that the .

Probability of G generations between MRCA and a sequenced isolate
Under the previous equation C(: è , : g |X7í è , X7í g , ì è , ì g ), represents the generation time distribution We can extract the joint probability of CA and CB being separated from the MRCA by GA and GB transmission generations respectively using the above derived generation time distribution and the time resolved phylogenetic trees.
Assuming the generation time is gamma distributed and all transmission events are independent the sum of the gamma distribution is also gamma distributed. Additional we can extract the evolutionary times EA and EB, separating CA and CB from the MRCA. As described in Salje et al. 2021, Equation 14 (16) we can estimate the probability of g transmission events over many trees allowing us to incorporate phylogenetic including evolutionary parameters from the tree structure.
We determined the probability for the number of generations from MRCA for each isolate for 1 to 1000 generations, using the generation time derived above.

Location of the MRCA (P(Lm))
We estimate the probability that on average, an MRCA is in each of the nine provinces in South Africa. We estimate parameters for each of 8 provinces setting Western Cape aside as a reference and dividing by the total across all 9 to ensure that the sum of the probabilities is 1. This again assumes no external introductions.

Calculation of Likelihood
We can calculate the likelihood using all pairs of available sequenced S. pneumoniae as described in Salje et al. 2021(16). We accounted for the observation process by incorporating the probability of sampling in each location for isolates belonging to each GPSC annually.

Likelihood Equation
We calculate the likelihood using all pairs of sequenced pneumococci as follows: Where ngpsc are the number of sequences available from GPSC gpsc.

Hamiltonian Markov Chain Monte Carlo
We used an MCMC approach to estimate our parameters using the package fmcmc implemented in R (47). We estimated 9 parameters: -a parameter that adjusts the probability of staying in the home municipality as compared to Meta mobility data.
-8 parameters capturing the relative probability that the MRCA of a pair of individuals was in each of the other eight provinces as compared to the province of Western Cape (the reference).
We only used pairs of sequences that were separated by under 10 years of evolutionary time between them. After 10 years there are limited spatial signals remaining as the bacterium has had many opportunities to move. This approach is to make the model computationally tractable.
In a sensitivity analysis we showed that increasing the model to 15 years resulted in similar estimates ( Figure S19).

Model performance
In order to assess the performance of our model, we used a simulation framework. We simulated 50000 pairs of events, locations, and the location of the MRCA between pairs, where the probability of mobility between each pair of locations was determined by the Meta mobility matrix adjusted by a parameter of known value of -2. We tested our ability to recapture this input parameter. To incorporate the biased observation process, we down-sampled the simulated data based on the by-province sampling probabilities from our true data.
We used the down-sampled data to fit our model with 20000 steps of an MCMC with a jump-step of 0.08 in 3 chains. We were able to recapture the down-sampled data utilising the human mobility framework and the estimated parameter for the probability of staying at home, accounting for the sampling probability per province. We then utilised the same human mobility framework and the estimated parameter, excluding the sampling probability and were able to recapture the complete simulated data including the input parameter ( Figure S9).

Model sensitivity analyses
Further we estimated our parameters only to include isolates from patients with invasive pneumococcal disease ( Figure S2B).
To test the impact of a range of generation times on the parameter estimates we also re-ran our MCMC with generation times of 15, 35 (as included in the model), and 55 days ( Figure S20).
We confirmed the chains converged for each of the nine parameters estimated ( Figure S21).

Person-to-person transmission chains
We simulated person-to-person transmission chains seeded in a starting municipality weighted by the population size. We determined the relative risk of being in a specific municipality after 10 transmission generations (approximately one year) across 500000 simulations.
We fixed the starting municipality to be rural (population density <50/km 2 ) or urban (population density >500/km 2 ) and repeated the above simulation. For 10000 sequential simulations we counted the number of unique municipalities affected and distance travelled at each transmission interval weighting by population size. We then determined the number of municipalities travelled to across all transmission chains ( Figure S7).

Branching epidemic
We simulated a branching epidemic in which we drew the number of transmission events seeded by each event from a Poisson distribution around an effective reproductive number (Reff) of 1 and amplitude of 0.15 over 100 iterations. We determined the mean distance from the starting municipality after 60 generations (5.8 years with a 35-day generation time) and the number of municipalities visited over that time. We calculated the uncertainty at the 95% confidence intervals of a normal distribution.

Overall Strategy
We developed logistic growth models to fit the changing prevalence of different serotypes utilizing a method which has previously been implemented for the endemic bacterium, Bordetella pertussis(48).

Vaccine status
We first binned serotypes into three groups those not included in the vaccine (NVTs), serotypes included in PCV7 (4,6B,9V,14,18C,19F,23F), and additional serotypes included in PCV13 (1,3,5,6A,7F,19A). We computed the relative abundance > 0,åsò of each group of serotypes, i., compared to a reference type ref.  where ri,ref is the growth rate of that abundance, shared across all provinces and fi,ref,0 is the initial relative abundance of the serotype group G with respect to a chosen 97>.
To control for the varying presence of all circulating serotypes through time, we present fitness as the average relative growth rate, 9 û ü, for each group with respect to a randomly selected group in the population: Equation 12.

1@0
where n is the number of groups, and > † ü is the average absolute frequency of the group in the period of time considered.
This average relative growth rate 9 û ü, can be identified as the selection rate coefficient of the group in the population considered(48). The selection rate coefficient is a direct measure of the fitness advantage of emerging variants and is one of the best indicators as to whether a strain will increase in frequency during an outbreak(49, 50).
We can further multiply the selection rate by the mean generation time (35 days) to obtain the selection coefficient per generation, which is the relative fitness advantage per transmission generation.
We also estimated the growth rate per serotype to capture whether the individual dynamics were concordant with, or deviated from, what was expected according to their respective group (NVT, PCV7, or PCV13)( Figure 4A-C).
To investigate whether the serotypes fitness changed across upon implementation of PCV7 and PCV13, we tested a range of models. We considered a model without any  (51). Further, we also tested for a potential delay between implementation of PCV7 and the change in fitness. Model comparison is presented in Figure S22.
The model performed best when the fitness switch was in 2009 (the initial year of vaccine implementation). We used this model in the main text.
To fit the model, we used Hamiltonian Monte Carlo as implemented in R-Stan(52). We used a Poisson likelihood to fit the observed proportion of sequences that were of each category and the total number of isolates in that year as an offset. The model estimated the proportion of isolates that were of each type at the start of the dataset and the fitness parameters.

Antimicrobial Resistance
We then used the model to capture the decreasing proportion of penicillin resistance in the population. To do this we incorporated the dynamics of vaccine type (VT; PCV7 and PCV13) and NVTs in the population, and the respective proportion of penicillin resistance within them.
We used the same approach described above to model the proportion of VTs, fvt, and NVTs, fnvt.
We then model the proportion of strains which were and were not penicillin resistant within each group, either VT, famr|vt, or NVT, famr|nvt. We estimate the fitness of resistance in each.
We then fit the model to all four groups, fvx,amr (Resistant NVT, Susceptible NVT, Resistant VT, and Susceptible VT).

>°¢ ,`^å (3) = >°¢(3) • >`^å |°¢ (3)
Where fvx is the proportion of either VT or NVT in the population and famr|vx is the proportion of penicillin resistance within each of those groups. We then derive the proportion of penicillin resistance overall in the population by summing the proportion of VT resistant, fvt,amr, and NVT resistant, fnvt,amr, strains.

Estimating Impact of Fitness on Migration
We included the calculated average relative growth rates in the simulated branching epidemic to calculate the mean number of municipalities visited, distance travelled, and probability of being in the home municipality over 5 years. We report uncertainty at one standard deviation from the mean. We repeated this incorporating the post-PCV selection coefficients for NVTs and VTs and estimated the relative increase in the number of municipalities visited for NVT serotypes compared to PCV serotypes after 2 years of transmission.

Fitness Model Carrying Capacity Sensitivity
Our fitness model assumes constant fitness over time. This specifically means that if a serotype is found to be fitter than the rest of the population, we expect it to replace the whole population after some time. However, the currently best supported model proposes that negative frequency dependent selection drives pneumococcus population dynamics (53), whereby lower frequency genes become more fit.
To test the impact of our assumption on our fitness estimates, we performed a sensitivity analysis.
We introduction minimum and maximum carrying capacities in our logistic model. In equation 14, we let the relative frequency > 0,åsò go to a maximum of £^`¢, and in equation 15, a minimum of We considered a range of values for £^0 ? (0.1, 0.05 and 0) and £^`¢ (0.9, 0.95 and 1). We tested the impact of this carrying capacity has on both the vaccine status model ( Figure S23), and antimicrobial resistance and vaccine status model ( Figure S24). In each case, we compare the fitness estimates obtained. We found the fitness estimates to be robust to variable carrying capacities across those tested implying our assumption does not impact this framework.

Code Availability
All scripts for analysis are available on GitHub at: https://github.com/sophbel/Analysis_GeographicMobility_pneumoPaper Trees are recombination masked, aligned to a reference for each GPSC. Time resolution was performed using BactDating.

Fig. S2. A) Mean geographic distance (km) across rolling 20-year divergence time windows between pairs for disease isolates alone (red) and for all isolates (blue). Subset to 20 years only as an inset. B) Mobility model fit (lines) against data (points) for disease alone (red) and all isolates (blue)
. This is a subset from Figure 1D.          To demonstrate the suitability of using centroid distances, we simulated a spatial transmission process for 1000 separate chains where at each generation a daughter point is placed at a randomly located location 350m in each of the x and y direction. A) This is repeated over 20 generations. B) We then identify the 'centroid' of each case based on the closest coordinate rounded to the nearest kilometre C). We then calculate the total distance covered for both the true distance (black) and the centroid distances (red).