Correction: Cuban history of CRF19 recombinant subtype of HIV-1

[This corrects the article DOI: 10.1371/journal.ppat.1009786.].


Introduction
For HIV-1, more than 100 recombining subtypes have been detected [1]. They have varying epidemiological features, such as pathogenesis and transmission patterns [2]. CRF19 is a recombinant form of HIV-1 subtypes D, A1 and G. It was first detected in Cuba in 1999 when four epidemiologically unrelated individuals (diagnosed in 1997) were sampled [3]. Casado et al. [3] proposed an African origin for CRF19 explained by its homology to an AG intersubtype recombinant from Cameroon and a subtype D virus from Gabon. However, while CRF19 has maintained a very low prevalence in the African continent through all these years, the virus has spread successfully in Cuba, placing it among the most prevalent circulating subtypes (24% of all sampled sequences by 2017). While CRF19 identification and first sequences were obtained from patients diagnosed in 1997 [3], a retrospective analysis performed by Perez et al. [4] exposed that this recombinant was already infecting Cuban patients who were diagnosed between 1986 − 1990, which evidences an early introduction in the outbreak of the epidemic. Potential central African ancestry of CRF19 could be explained by presence of numerous Cuban military and civilian personnel in several sub-Saharan African countries (the Democratic Republic of the Congo (DRC), Angola, Ethiopia) from 1960s to 1980s [5]. However, CRF19 was never reported in sub-Saharan Africa.
In Cuba, CRF19 was found to cause particularly rapid progression to AIDS (< 3 years after seroconversion, versus 5 to 10 years for other subtypes and CRFs [6]) and is associated with drug-resistance [7] in many newly infected patients. Several subsequent introductions related to Cuban samples were then reported in Spain (with one limited outbreak in southern Spain, with mild pathogenicity) [8][9][10] and Tunisia [11]. Understanding the origin, transmission and drug resistance acquisition of such aggressive subtype is therefore key to impede its further spread in other regions of the world.
Delatorre and Bello [12] performed a maximum likelihood (ML) phylogeographic study of CRF19 and D sequences (as CRF19 is subtype D in the pol fragment analysed) from Los Alamos HIV Sequence database. Their data set contained 160 Cuban CRF19 sequences sampled between 1999 and 2011, 3 European CRF19 sequences and 2 000 subtype D sequences from Africa (1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011). They discovered that all Cuban CRF19 sequences branched in a highly supported monophyletic sub-cluster that was nested within subtype D pol sequences of central African origin. This indicates a common recombined ancestor from which CRF19 spread further. The 3 CRF19 isolates in Europe branched within the Cuban clade, indicating a transmission from Cuba to Europe. They also performed a Bayesian analysis to estimate the time of the most recent common ancestor (MRCA) of the Cuban clade: The median of the estimated date was 1987 (confidence interval (CI) 1983 − 1991).
The goal of this study is to better understand the geographic origin and transmission patterns of CRF19, especially regarding transmission mode and drug resistance, using the new data from Cuba and international databases. New CRF19 and D-infected patients were detected and sequenced in Cuba and worldwide since the study of Delatorre and Bello [12]. Moreover, additional information can be extracted by performing similar studies based on the parts of CRF19 that correspond to subtypes A1 and G. Importantly, with more fine-grained geographic data available for some of the Cuban sequences (e.g. provinces of diagnostics) the CRF19 spread within the country can now be elucidated. Similarly, as for many newly sampled Cuban sequences the information on the risk group is available, it can be used to reconstruct the transmission mode scenarios to learn how CRF19 spread between and within individual communities. Lastly, as transmitted drug resistance (TDR) is becoming predominant in several countries (e.g. UK [13]) and HIV transmission might be further impacted by drug resistance, we also investigated drug resistance mutation spread of CRF19.
We performed maximum likelihood phylogenetic analyses using fast and easy-to-use software, to investigate the spread of the virus, its mode of transmission and drug resistance patterns. Our new Cuban data set contains rich metadata (diagnostics dates, sampling dates, risk groups, province of diagnostics, etc.), from which we obtained spatio-temporal predictions for worldwide and local CRF19 spread as well as its mode of transmission among risk groups in Cuba. The article is organized as follows: we first describe the data set and the insights on CRF19 origin and transmission patterns that we acquired in this study. We then detail the phylogenetic methods used for the temporal, phylogeographic, transmission mode and drug resistance analyses, which could be used for other HIV-1 CRFs or subtypes, and other viruses.

Results
Data sets We combined the data from two sources: a Cuban data set (from now on referred as CU) and the 2018 HIV-1 compedium genome alignment from the Los Alamos HIV database (LA) [14], sampled between 1983 and 2018. LA contained 3 884 sequences of all M-group HIV subtypes and recombinants that were curated. We chose this smaller but curated data set in order to avoid noisy and erroneous data (partial sequences, erroneous date, subtype or country annotations for non-curated sequences, etc.) that could perturb our analyses. CU contained 1 674 partial pol and 164 partial env sequences (of various subtypes) recently sampled in Cuba, which we subtyped and aligned against LA as described in Materials and methods. We created three data sets out of the combined LA+CU alignment for CRF19 analyses (described in more detail in Table 1): 1. D+CRF19 -the D-part of CRF19 (C-part of gag, PR, RT and nef, breakpoints from [3]) for 386 sequences of subtypes D and CRF19 (276 sequences), from CU and LA.
2. A1+CRF19 -the A1-part of CRF19 (N-part of gag, Integrase, env ) for 218 sequences of subtypes A1 and CRF19 (31), from CU and LA. Phylogenetic trees For each data set, we reconstructed phylogenetic trees as described in Materials and methods and found that all the CRF19 sequences formed a unique cluster (see colourstrips in Fig 1). For A1+CRF19 and D+CRF19 data sets the CRF19 cluster was placed within the A1 and D trees, while for G+CRF19 data set it formed a sister-group to the G subtree (the G+CRF19 data set contained few sequences and did not include G sequences branching before the split with the CRF19 clade). There were several mis-annotated sequences in CU data set. For example CU1437-17 was placed within the D sequences in the D+CRF19 tree, while annotated in CU as CRF19: As it only contained the D-part of the sequence (and therefore could not be distinguished between CRF19 and D during the non-phylogeny-based subtyping) the annotation was most probably erroneous. In a similar manner, the CRF19 clusters of A1+CRF19 and D+CRF19 trees included a few A1 and D sequences respectively: Those sequences only contained the A1 or D fragments and therefore were likely mistyped as well.
Dating The dates obtained for the MRCA of the CRF19 cluster and for its parent (i.e. the most recent D/A1/G-subtype ancestor of the CRF19 cluster) estimated on the three data sets are shown in Fig 1 and in Table 2. For all the data sets the CRF19 MRCA date is estimated in the 1970s, and the most recent A1/D/G-subtype ancestor dates are 1969/1959/1958 respectively. Note that as the three data sets include different CRF19 sequences, the CRF19 cluster MRCA dates do not have to agree, but they all correspond to the time after CRF19 recombination. Similarly, the A1/D/G-subtype ancestor dates correspond to the time before recombination and do not have to be the same for the three data sets, but have to be more ancient than all of the CRF19 MRCA dates. Based on our estimations, the dates of origin inferred by using different parts of the CRF19 are all consistent and go in the same direction: we obtain a confident estimation of the date of recombination, occurring between 1966 (lower CI of the date when in A1+CRF19 data set it was still A1) and 1977 (upper CI of CRF19 MRCA in . Nodes and branches are coloured by country (ACR performed by PastML, MPPA+F81), as explained in the legend in bottom left: Cuba is coloured red, African countries are in shades of violet apart from the the DRC (black), European countries are in shades of green, Asian countries are in shades of pink, Australia is yellow. Colourstrips around the trees show sample subtypes: A1 (light-blue), D (yellow), G (light-orange), CRF19 (red), and A1-A3 recombinant (blue, note that this sequence was used in the A1+CRF19 data set as its part that aligns with the A1-part of CRF19 is fully A1). CRF19 sequences form a cluster for all data sets. Subtype misannotations lead to several D-and A1-annotated sequences within the corresponding CRF19 clusters, and one CRF19-annotated sequence (CU1437-17 ) outside of the CRF19 cluster in the D+CRF19 tree. Dates (with CIs, reconstructed by LSD2) and countries (with marginal probabilities, reconstructed by PastML) of the CRF19 cluster MRCAs and their non-CRF19 (i.e. A1, D or G) parent nodes are shown. The dates of non-recombinant Cuban clusters are also shown. the D+CRF19 data set). We also dated the non-recombinant A1/D/G-subtype Cuban sequences in our data sets. They were estimated to be introduced to Cuba later: 1982 for G, 1996 for A1 and in the 2010s for D (see Fig 1), supporting the hypothesis that the recombination occurred in Africa, as co-infection and thus high-level circulation of the 3 subtypes was needed for recombination. Phylogeography The country predictions (and their marginal probabilities) for the MRCA of the CRF19 cluster and of its parent (i.e. the last A1/D/G-subtype ancestor) estimated on the three data sets are shown in Fig 1. The results suggest that CRF19 recombined in the Cuban community, as the CRF19 MRCAs in all data sets and the A1 ancestor have Cuba as the country reconstruction. The reconstructions also suggest that the D-strain ancestor has a DRC origin, while the G-strain reconstruction hesitates between Cuba and the DRC (with a slight preference for the DRC). Overall, these results fit well with the hypothesis of a recombination within the Cuban community in the DRC around 1970s [1966 − 1977]. We further performed a province-level analysis of the CRF19 subtree from the D+CRF19 data set. The root of CRF19 subtree (dated in 1974) was unresolved with two most probable provinces being Villa Clara (0.64 as marginal probability) and La Habana (0.24). This prediction is not surprising and is consistent with respect to the previous dating and country predictions: the CRF19 subtree most likely originated in the Cuban community in sub-Saharan Africa. From there the epidemic spread to Villa Clara very early (marginal probability of 0.92, date 1977, CI: 1973 − 1979), before HIV discovery in the 1980s and likely before its introduction in a number of developed countries by the very end of 1970s (e.g. USA, UK [15]). Then CRF19 gave rise to multiple introductions from Villa Clara to La Habana in the 1980s and 1990s and then further spread to other provinces (Fig 2).   [16] (MPPA+F81). Nodes and branches are coloured by provinces (colour code is explained below), gray thinner branches correspond to a province change, multi-coloured nodes correspond to multiple possible states (e.g. the root, unresolved between Villa Clara and La Habana). The oldest resolved node and its date are indicated and correspond to Villa Clara (marginal probability 0.92), the dates and marginal probabilities of the main introductions from Villa Clara to La Habana are also marked. RT:M41L some TDR was present (see Fig 4): RT:M41L is associated with resistance to AZT [18], the first ARV accepted in Cuba (used since 1987).

Discussion
CRF19 is a predominant circulating recombinant form of HIV-1 in Cuba, with 24.1% of new HIV-positive patients being infected with it [19]. We investigated its origin and transmission patterns using new sequence data from Cuba. Based on our date estimations, in 1966 CRF19 still did not exist as in the A1+CRF19 data set it was still A1 (lower CI), but by 1977 it already recombined as in the D+CRF19 data set this date is the upper CI of the CRF19 MRCA. This estimate is older than the one obtained by Delatorre and Bello [12] (CI: 1983 − 1991) but is not contradictory as Delatorre and Bello estimated the date of the CRF19 MRCA on a data set that contained less CRF19 sequences and hence should be regarded as the upper bound of the recombination date. Our result leaves both a recombination in Cuba and a recombination prior to introduction to Cuba possible (as Cubans were present in sub-Saharan Africa and especially DR Congo since the 1960s [5]). Recombination in Cuba was proposed by Delatorre and Bello for other prevalent HIV-1 recombinant forms (CRF20 BG, CRF23 BG, CRF24 BG), which probably emerged between 1996 − 1998 from a single CRF BG ancestor dated in 1991. However, for CRF19 it seems unlikely that the recombination event took place in Cuba considering that (1) we traced MRCA for CRF19 back to 1977 or before, (2) the first Cuban HIV-1 patients were not diagnosed until 1986, (3) the non-recombinant A1/D/G-subtype Cuban sequences in our  [16] (MPPA+F81), the visualisation is performed with iTOL [17]. Nodes and branches are coloured by transmission mode (green for HET, beige for MSM), gray thinner branches correspond to a state change, gray nodes have unresolved states (MSM or HET). The colourstrip around the tree shows the gender of the sampled individuals (red for female, blue for male), and we see several transmissions from the MSM community to female HET individuals. data sets were estimated to be introduced to Cuba after 1977 (1982 for G, 1996 for A1 and in the 2010s for D), while (4) recombination had to occur in a scenario of co-circulation of subtypes A, D, and G. Moreover, countries from sub-Saharan Africa were proposed to be sites of origin for other AG and AD recombinants within the same period of time [20,21]. All together, we can conclude that CRF19 was among the very first CRFs/subtypes to be introduced in Cuba, by the mid of 1970s. Other subtypes have undergone the same journey, from Central Africa to the Caribbean, for example subtype B, which was present in Haiti since the end of the 1960s [22]. However, while subtype B rapidly reached the USA and then Europe and the rest of the world, CRF19 stayed at Cuba, where it represents a major issue due to its prevalence and short time to AIDS. To our knowledge, this is the first study which attempts to predict the origin of HIV-1 epidemic (and CRF19) by Cuban province employing sequence data. The phylogeography analysis detected that La Habana and Villa Clara played a major role in epidemic spread. We estimated that CRF19 was initially introduced to Villa Clara in the late 1970s (marginal probability of 0.92), had multiple introductions to La Habana in 1980s and 1990s and spread from there to other provinces. In agreement with our results, Dhanjal et al. [23] employed a series of graph algorithms to predict the evolution of the HIV/AIDS network in Cuba between the years 1986 and 2004. These authors suggested that at the end of 1986 epidemic was mostly located in La Habana, with 31 detections, and pointed out that Villa Clara and Sancti Spiritus were among the first provinces showing an increase in the detection rate. A sudden growth in detection in La Habana was expected given the high population density of this city and the difficulties to control the epidemic. In addition, an analysis performed by Miranda et al. [24] showed that, until 2010, Villa Clara remained among the provinces with a high incidence rate.
Our analysis on the transmission mode recreates some of the Cuban HIV-1 epidemic features such as a rapid spread among MSM community and multiple introductions to HET population (16% of all transmissions with MSM source, 50% of which were to females), indicating potential importance of bisexual individuals in the epidemic spread. According to epidemiological data, Cuban epidemic started as a mainly HET epidemic, after 1989 detection prevailed in homo/bi-sexuals, however, by 2002, a sudden rise in HET group occurred, ultimately showing a tendency to parity [24,25]. In our reconstruction, CRF19 epidemic in Cuba followed a similar pattern and started in HET group, which agrees with the Pérez et al. [25] finding that among 28 individuals diagnosed with HIV between 1986 and 1990, 25 were HET and were infected with a high diversity of genetic forms.
Lastly, we did not detect almost any transmitted drug resistance, apart from several small TDR clusters for RT:M41L, a DRM which is associated with resistance to AZT, the first ARV accepted in Cuba (in 1987). One of the reasons why we could not trace almost any TDR might be because the sequences employed in this study were obtained as a result of Drug Resistance Surveillance in Cuba, and not the result of contact tracing studies. Several studies on the prevalence of TDR and a national survey from 2017 [19,26] showed that, despite a high drug resistance incidence (30% pre-treatment drug resistance), no clusters of TDR have been identified in Cuba to date for any HIV-1 subtype. However, epidemiological data combined with a small genetic distance among sequences from different patients has evidenced that, indeed, transmission of resistant HIV-1 genetic variants occurs in Cuba. This was almost invisible in our data, most likely due to their scarcity (< 10% of infected individuals).
Considering all this, we suggest that CRF19 recombined in the Cuban community present in sub-Saharan Africa since the 1960s, most probably in the DRC, as a consequence of the co-circulation of subtypes A, D and G. It was introduced in Cuba very early, in the 1970s, and started to spread from a single transmission event, first in Villa Clara and then in the 1980s in La Habana. This early introduction to Cuba, which is earlier than the introduction of A, D and G subtypes, for example, could explain the local success of CRF19, which had no particular success in other parts of the world. This local success would thus result from a founder effect rather than some transmission advantage of CRF19, as supported by the relatively stable incidence of CRF19 through time: 18 . However, importantly, the fact that the time to AIDS with CRF19 is substantially lower than with other subtypes and CRFs (3 years versus 5 − 10 years [6]), makes it crucial to survey CRF19 sub-epidemics not only in Cuba but also in other parts of the world having regular exchanges with Cuba.

Materials and methods
Alignment and subtyping LA dataset contained 3 884 aligned and subtyped sequences of all M-group HIV-1 subtypes and recombinants. CU contained 1 674 partial pol and 164 partial env sequences (of various subtypes) recently sampled in Cuba. We subtyped (pure subtypes and recombination positions) and aligned them against LA using jpHMM [27] (for detailed options see Analysis pipelines). When several sequences were present for the same patient in CU, we kept only the first one (in terms of sampling date). Most of the sequences in the CU data set were pre-annotated with a subtype; we replaced these annotations with jpHMM predictions in cases when they were contradictory (e.g. not matching jpHMM breakpoints for CRF19). All together, we obtained a large MSA of 5 707 sequences, including both LA and CU, from which we extracted our D+CRF19, A1+CRF19 and G+CRF19 data sets (see Table 1 for details).
Metadata LA alignment contained country and sampling date annotations. The sequences from CU were also annotated with the province and date of diagnostics (which sometimes preceded sampling by several years), patient gender, risk groups (HET or MSM) and treatment status (treated or treatment-naive). For the DRM analysis we extracted the presence/absence of the surveillance DRMs with Sierra web service [28] (for detailed options see Analysis pipelines)).
Time-scaled trees For each data set we reconstructed a phylogenetic tree using RAxML-NG (v0.9.0, evolutionary model GTR+G6+FO+IO, for detailed options see Analysis pipelines) [29] and rooted it with an outgroup (5 LA sequences of a different subtype, B for D+CRF19, A6 for A1+CRF19, A for G+CRF19, which were then removed).
We then dated each tree with LSD2 [30] (v1.4.2: github.com/tothuhien/lsd2/releases/tag/v.1.4.2, under strict molecular clock with outlier removal, for detailed options see Analysis pipelines) using tip sampling dates, and constraints based on the diagnostics dates. As a phylogenetic tree is a proxy for a transmission tree with incomplete sampling, its branches correspond to patients and its internal nodes correspond to virus transmissions from a donor patient (parent branch and one of the child branches) to a recipient patient (the other child branch). As the recipient patient must have been infected (internal node) before being diagnosed, we put constraints for the dates of the internal nodes that had two tip children, at the dates of diagnostics of their recipients. As one cannot distinguish between the recipient and the donor branches in the tree (see Fig 5 for more details), we picked the more recent diagnostics date between the two and imposed the transmission node date to be before this diagnostics date. The temporal constraints on internal nodes are incorporated in LSD2 using the active-set method in the same way as it was used in LSD [30] to ensure the temporal precedence on ancestral nodes. (right) the donor was another individual that was not sampled and does not appear in the tree, therefore that transmission must have happened before both A and B got diagnosed. In practice, however, we cannot know which of the three scenarios is the correct one and therefore have to pick the more recent diagnostics date (B, Sep 2014) as the upper-bound constraint for the internal node. Ancestral character reconstruction (ACR) We analysed geographic origin and transmission patterns of the CRF19 epidemic worldwide and in Cuba using ACR with PastML tool (v1.20, for detailed options see Analysis pipelines) [16]. PastML infers an ancestral scenario for a given character (e.g. location) on a rooted phylogenetic tree, by assigning the likely ancestral character states to every node in the tree, based on the character annotations provided for some of the tree nodes (typically tips). PastML implements a very fast ACR in maximum likelihood framework, under an s-state (where s is the number of possible character states, e.g. countries) F81-like [31] model, and can handle very large phylogenies (dozens of thousands of tips), while being robust to mild sampling bias and model violations [16]. With marginal posterior probabilities approximation (MPPA) method a unique state is predicted in tree nodes with low uncertainty, whereas several states are predicted if they have similar marginal probabilities.
For phylogeographic analysis we reconstructed ancestral scenarios for country annotation on the full trees reconstructed from the three data sets. Additionally, we reconstructed ancestral characters for Cuban provinces on the CRF19 subtree of the D+CRF19 tree (which included most of the CRF19 sequences in our data sets). As for the province-level analysis the annotations were specified at the date of diagnostics, which often preceded sampling, we created additional internal nodes in the tree, corresponding to diagnostics, named them, and associated the input province annotations for PastML with them instead of tree tips (see Fig 6). To add the diagnostics nodes, we processed each tip T whose patient's diagnostics date diag(T ) preceded the sampling date d(T ) > diag(T ). If the diagnostics happened after the last transmission (i.e. its parent node P , where d(P ) < diag(T )), we split the tip branch into two parts with a one-child internal node D at the time of diagnostics (d(D) = diag(T )): The branch length of D became dist(D) = diag(T ) − d(P ), while the branch length of T became dist(T ) = d(T ) − diag(T ). Otherwise (when the diagnostics happened before the last transmission diag(T ) < d(P ), which meant that the parent branch corresponded to the same (donor) patient as the tip), we processed the parent node (or recursively the grandparent node, etc.) instead, in a similar manner.
PastML allows for input annotations to be associated with any node in the tree (not only tips), and infers the states of the other nodes (including unannotated tips) during ACR. The computation of the Up and Down-likelihoods for a one-child node D being in a state i is a corner case of the formulas described in the "Materials and Methods" section of the original PastML publication [16]: Down(D, i) = j P C(i → j/dist(T ))Down(T, j), U p(D, i) = j P C(i → j/dist(D))U p(P, j). If an internal node's state is provided in the input annotation file, then the Down-and Up-likelihoods of it being in any other state are reset to 0. The rest of the formulas from [16] are directly applicable.
To further understand the Cuban CRF19 epidemic, we also reconstructed ancestral scenarios for transmission mode (i.e. MSM or HET). The patients in the CU data set were annotated with their risk group (MSM or HET), and we used these data as a proxy for their potential HIV transmission mode at the moment of diagnostic and sampling: We associated the same annotations to both the corresponding single-child internal nodes (representing diagnostics) and to the tips (representing sampling). Consistently, the transmission mode is the same for the donor and the recipient at the moment of transmission (internal nodes of the tree), but can change with time for the same person. Note, moreover, that the MSM/HET status is uncertain, as some MSM might be bisexual or declared as HET [32].
Lastly, we analysed DRM presence/absence on the CRF19 subtree of the D+CRF19 tree as the D-part of CRF19 includes the parts of the virus genome (PR, RT) that may contain DRMs. We analysed the DRMs that had a prevalence > 10%: RT:M41L, Fig 6. ACR for provinces of Cuba. To reconstruct the ancestral states for characters annotated at diagnostics dates (such as provinces of Cuba), we added additional nodes to the time-scaled trees at times of diagnostics, named those nodes (on the middle panel, the diagnostics nodes, e.g. a, correspond to the tips with the same names but capitalised, e.g. A), and associated the input annotations for PastML with them. We then reconstructed ancestral and tip characters with PastML based on those annotations (see text for calculation details). We checked the antiretroviral drugs (ARVs) that can provoke the DRM and the dates of their introduction to Cuba (e.g. RT:M41L is associated with resistance to AZT, which was the first ARV for HIV and was used in Cuba since 1987, to D4T and potentially to DDI, both of which were used in Cuba since 2001). We then cut the time-scaled tree (with named internal nodes) at the earliest of these dates (e.g. 1987), therefore obtaining the pre-treatment-introduction tree and a forest of post-treatment-introduction subtrees. For the trees in the forest we added additional one-child root nodes (as parents of the corresponding tree roots, at distances that corresponded to the differences between the root dates and the ARV introduction date), which we marked as sensitive in the PastML input annotation file. We performed the ACR with PastML on the forest (functionality that we added to PastML since version 1.9.19): the forest likelihood is calculated by multiplying the likelihoods of all its trees. For the new roots their Up-likelihood was set to 1 for the sensitive state and to 0 for the resistant state, and the Down-likelihood calculated as described above. Once the forest ACR was performed we combined it with the all-sensitive annotations for the pre-treatment-introduction tree nodes.