The origin, evolution and molecular diversity of the chemokine system

Chemokine signalling performs key functions in cell migration via chemoattraction, such as attracting leukocytes to the site of infection during host defence. The system consists of a ligand, the chemokine, usually secreted outside the cell, and a chemokine receptor on the surface of a target cell that recognises the ligand. Several non-canonical components interact with the system. These include a variety of molecules that usually share some degree of sequence similarity with canonical components and, in some cases, are known to bind to canonical components and/or to modulate cell migration (1, 2). While canonical components have been described in vertebrate lineages, the distribution of the non-canonical components is less clear. Uncertainty over the relationships between canonical and non-canonical components hampers our understanding of the evolution of the system. We used phylogenetic methods, including gene-tree to species-tree reconciliation, to untangle the relationships between canonical and non-canonical components, identify gene duplication events and clarify the origin of the system. We found that unrelated ligand groups independently evolved chemokine-like functions. We found non-canonical ligands outside vertebrates, such as TAFA “chemokines” found in urochordates. In contrast, all receptor groups are vertebrate-specific and all - except ACKR1 - originated from a common ancestor in early vertebrates. Both ligand and receptor copy numbers expanded through gene duplication events at the base of jawed vertebrates, with subsequent waves of innovation occurring in bony fish and mammals.


INTRODUCTION
The chemokine system is responsible for regulating many biological processes, including host defence, neuronal communication and homeostasis (3)(4)(5).The system has two components, a ligand, usually a small cytokine called a chemokine, and a receptor.It typically operates through chemoattraction, wherein one cell type produces and secretes chemokines, creating a chemical gradient as these molecules disperse.Cells equipped with the corresponding chemokine receptors on their membranes can recognise and bind to specific chemokines, promoting their migration along the gradient (4).This mechanism allows cells to reach target locations, such as infection sites during inflammation or tissues important for homeostatic functions, e.g., leukocyte maturation and trafficking (3,6).Chemokines involved in the latter homeostatic functions are usually constitutively expressed, while those involved in inflammatory responses have an inducible expression (7).
Chemokine ligands are categorised into four groups, XC, CC, CXC, and CX3C, according to the pattern of cysteine residues in the N-terminal portion of the protein (8).Likewise, the receptors are classified based on the ligands they bind to into four groups, the XCR, CCR, CXCR, and CX3CR, and all of them belong to the GPCR class A superfamily (9).In addition to canonical components, other molecules have been discovered to function similarly to chemokine ligands (1) or receptors (2) (see Table 1).
Despite the extensive research on the chemokine system, with over 320,000 papers available on PubMed, many aspects of its evolution remain unclear.For instance, the homology between canonical and non-canonical ligands is uncertain and supported by circumstantial evidence, such as shared specific motifs (12,13,19,29).Furthermore, the relationships between canonical, atypical, and viral receptors and the outgroup of the canonical chemokine receptors remain uncertain.Finally, the evolutionary history of the canonical and non-canonical components remains poorly understood outside a few key model systems (9,30,31).These outstanding questions share common underlying causes, including the use of inadequate inference methods (such as relying solely on sequence similarities) and limited sampling of species (e.g., focusing mainly on humans, mice, and zebrafish (7,32)).Additionally, solving the phylogenetic relationships for short molecules such as chemokine receptors and ligands is particularly challenging due to the lack of strong phylogenetic signals (33).
Here, to clarify these outstanding questions, we use state-of-the-art phylogenetic methods, including those designed for single-gene phylogenies, a large taxonomical sampling comprising both vertebrate and invertebrate genomes and the entire complement of canonical and non-canonical components of both receptors and ligands.Our findings substantially clarify the phylogenetic relationship between canonical and non-canonical ligands and receptors and suggest that unrelated proteins evolved "chemokine-like" ligand function multiple times independently.Additionally, we discovered that all the canonical and non-canonical chemokine receptors (except ACKR1) originated from a single duplication in the vertebrate stem group, which also gave rise to many GPCRs.Lastly, we characterized the complement of canonical and non-canonical components in the common ancestor of vertebrates and identified several other ligands and receptors with potential chemokinerelated properties that could be explored in future functional work.

RESULTS
There are five unrelated groups of ligands.
Initially, we focused on the ligands, including all the canonical chemokines, the CYTL, the TAFAs and the CKLF Super Family (CKLFSF) proteins (Table 1).The presence of a four transmembrane MARVEL domain in the latter proteins (12,34,35) distinguishes them from canonical chemokines, the CYTL and the TAFAs.Therefore, we separated these two groups for further analysis.Using BLASTP or PSI-BLAST (36)(37)(38) (see Materials and Methods for more details) against 64 species from 19 animal phyla (Table S1), we identified 891 putative homologs for chemokines, TAFA and CYTL and 602 putative homologs of the CKLF Super Family.
We utilised CLANS (39,40), a clustering tool based on sequence similarity and local alignment, to identify homology within these two groups.Unlike traditional phylogenetic methods, CLANS assigns homology between sequences based on BLAST and customisable stringency levels defined according to p-values (39).When two (or more) sequences are connected at a lower p-value (closer to 0), this indicates a high level of homology.Conversely, if two or more sequences only connect at a higher p-value, this suggests a relatively low level of sequence homology.Our analysis shows that canonical chemokines form a distinct group with a clear distinction between C-X-C-type and C-C-type (Figure 1A).Whereas CXCL17, TAFA and CYTL remain separate from canonical chemokines and from each other even at the loosest p-values tested (Figure 1A).The distinction between CXCL17 and all other canonical chemokines is consistent with our receptor results, showing that the potential receptor for CXCL17, GPR35 (41), is also not within the canonical chemokine receptor group (see below).However, it is important to note that recent studies fail to demonstrate CXCL17 activity at GPR35 (42,43).Within the CKLFSF, two large clusters were identified, named CKLF I and CKLF II, although these ultimately connect to form one large superfamily (Figure 1B).These clusters are robust to the different stringency thresholds used (Figures S1 and S2 and Materials and Methods for further details).Our results indicate that even when the stringency level to detect homology is relaxed, canonical chemokines, TAFA, CYTL, and CXCL17 remain in distinct clusters.This suggests that, similarly to CKLFs, these proteins are not homologous and convergently evolved chemokine-like properties.We have thus identified five distinct groups of ligands: i) the canonical chemokines, ii) TAFA "chemokines", iii) CYTL, iv) CXCL17, and v) CKLF Super Family (Figure 1A and 1B).
The evolution of chemokine and chemokine-like ligands in animals.
To better understand the evolution of both canonical and non-canonical chemokine ligands, we performed a separate phylogenetic reconstruction for each group (Figure 1C, D and Figures S8-17) (see methods for details).To evaluate the nodal support, in addition to the UltraFast bootstrap (UFB) (44,45), we used Transfer Bootstrap Expectation (TBE), a method that has been developed for single-gene phylogeny (46).To evaluate ortholog/paralog relationships and overall dynamics of the ligand complement, we used GeneRax (47).This new method uses maximum likelihood to reconcile the gene tree with the species tree (47).In brief, given a gene and species tree, GeneRax uses a maximum likelihood approach to optimise the duplication and loss events (47,48).
Our analysis initially identified a few invertebrate putative chemokine ligands (Figure 1A), however, these sequences lacked protein signatures associated with the canonical ligands (Figures S3-5 and File S3), and they were therefore excluded from further analysis (see Supplementary Results for further information).The phylogenetic tree for the canonical ligands identifies two major groups, the CC-type, which also includes the XC-and X3C-types, and the CXC-type (TBE = 0.95, UFB = 92%) (Figure 1C and Figures S8,9), confirming the previous finding obtained using synteny data (49,50).Next, to clarify the distribution of canonical chemokines, we first reconciled their gene tree with the species tree and then used the reconciled tree to trace the presence/absence of each chemokine group throughout all the species (Figure 2A and Figure S18).Our results confirm previous findings that canonical chemokines are uniquely present in vertebrates (30,49).Additionally, they indicate that chemokines are not evenly distributed across vertebrates and can be different even between closely related species (51).Some are very ancient, e.g., CXCL12 is present in lamprey; CXCL14 and CCL20 are present in all jawed vertebrates; and CXCL8 is present throughout bony fishes and tetrapods, with few exceptions, notably mice and rats.However, a large part of the chemokine diversity evolved within mammals (e.g., CXCL1/2/3, CXCL16, and CCL25), particularly placentals (e.g., CXCL5/6 and CCL3/18).The phylogenetic relationships we uncovered in our reconciled tree were mostly compatible with known syntenic relationships as described in human (7).For example, the large cluster of CXC-type chemokine genes present in human chromosome 4 contains CXCL1-11 plus CXCL13 (7), all of which coalesce within a monophyletic group in our tree (Figure 2A).The microsynteny within this cluster is also, to some extent, reflected in the phylogenetic relationships.
Similarly, the other large syntenic cluster of chemokines, located on human chromosome 17, containing most of the CC-type chemokines (7), corresponds, with few exceptions, to a large monophyletic clade in our tree (Figure 2A).CXCL16 which is on a nearby locus of chromosome 17, is also phylogenetically related to this CC-type clade (Figure 2A).The complement of the canonical chemokines undergoes the largest expansion at the base of jawed vertebrates, where there is an expansion from 4 to 18 genes (Figure 2B).A second expansion occurred at the base of bony fishes (i.e., Osteichthyes) followed by relative stability until placental mammals, where the total number of canonical chemokine ligands jumped to 45 genes.Finally, unlike previous works (52) our results support the presence of orthologs of both CC-type and CXC-type in the common ancestor of all vertebrates (Figure 2A).
Differently from the canonical chemokines, we identified a bona fide TAFA, i.e., with specific protein motifs, in the urochordates, the sister group to vertebrates (see Supplementary Results and Figures S6-7).The phylogenetic trees (Supplementary Figures S10,11) identified monophyletic groups for TAFA5 (TBE=0.98,UFB=98%), TAFA1 (TBE=0.94,UFB=98%), TAFA4 (TBE=0.77,UFB=75%) and TAFA2/3 (TBE=0.65,UFB=84%).The reconciled tree from GeneRax places the root at the urochordate sequence (Figure S19), therefore clarifying that the TAFA5 clade is the sister group to TAFA1-4 (Figure 2A).The family originated in the ancestor of urochordates and vertebrates, and the first duplications occurred at the base of vertebrates giving rise to the TAFA5 split followed by the TAFA1 split.Subsequently, at the base of jawed vertebrates, additional duplications bring the complements from 3 to 10 (Figure 2B), giving rise to the remaining groups so that all jawed vertebrates possess the full diversity of TAFAs.
The phylogenetic trees for CYTL and CXCL17 mainly reflect the species trees (Figures S12-15), and the reconciliations revealed very simple complement dynamics (Figure 2B and Figures S20,21).
However, these molecules show a remarkable difference in their distribution.CYTLs are present throughout gnathostomes, while CXCL17 is found only in placental mammals (Figure 2A).
The phylogenetic analysis for the CKLF super family (Figure 1D and Figures S16,17) recovered a monophyletic clade for the CKLF I group (TBE=0.96,UFB=80%) that we had already identified through CLANS.This group contains CKLF, which is known to interact with C-C chemokine receptor 4 (10,11), as well as CMTM1, 2, 3, 5, and proteolipid protein 2 (PLP2).Other monophyletic clades that are consistent with the CLANS are CMTM4/6 (TBE=0.90,UFB=61%), CMTM7 (TBE=0.92,UFB=83%) and a clade containing CMTM8 plus other related molecules such as plasmolipin (PLLP) and myelin and lymphocyte proteins (MAL) (TBE=0.89,UFB=60%).The latter were all part of a large cluster that we called CKLF II in the CLANS (Figure 1B).However, the placement of the root of the tree in Figure 1D can affect the interpretation of the relationships among CKLF II subgroups.To address this problem and clarify the patterns of duplications and the presence/absence of each group throughout animals, we used GeneRax to reconcile the gene with the species tree (see above and Material and Methods for details).Our results suggest (Figure 2 and Figure S22) that most CKLFSF groups, such as CMTM4,6 and 8, originate in the vertebrate stem group from pre-existing CMTM genes and are widely distributed in animals.The CKLF I subgroups originate from duplications at the base of jawed vertebrates, except for the split between CKLF and CMTM1 that occurs only within mammals (Figure 2A).We observe the major two expansions of the CKLFSF genes in the stem group of vertebrates (from 6 to 10 complements), and then in jawed vertebrates (from 10 to 16 complements).Interestingly, the extent of these expansions are less drastic than those we see for canonical chemokines (Figure 2B).In total, we have identified that the five distinct ligand groups have a different origin in the animal tree of life and underwent divergent evolutionary histories.
Canonical and non-canonical chemokine receptors are divided into four groups.
Next, we investigated the origin and pattern of duplication for the chemokine receptors and chemokine-like receptors (Table 1).Using BLASTP against the 64 species, we identified 7,157 putative chemokine receptors (see Materials and Methods for more details) and investigated their relationships using CLANS (see above for justification).The result (Figure S23C) identifies four main groups of chemokine receptors and chemokine-like receptors.The first comprises canonical receptors (i.e., CCR, CXCR, CX3CR1, CX3C, and XCR1), and the second includes atypical receptor 3 and GPR182, which has been recently shown to have chemokine receptor activity (53).The third group, which we named Chemokine-like plus (CML-plus), contains the chemokine-like receptors (CML1 also known as chemerin receptor 1), formyl peptide receptors (FPR) that bind the TAFA ligands (15,16) and other GPCRs such as GPR1 (chemerin receptor 2), GPR33, PTGDR2.Furthermore, the CLANS analysis identifies an intermediate group containing angiotensin, apelin and other receptors and shows sequence similarity to canonical and chemokine-like receptors (Figure S23B).
Finally, our analysis identifies a small cluster composed of only ACKR1 that do not connect to other GPCRs or other atypical receptors even at loose p-value thresholds.This indicates that their sequence is either non-homologous or highly divergent from other chemokine receptors and atypical receptors.Overall, these groups are robust to the stringency threshold used (i.e., different p-values) (Figure S23).Interestingly, no specific cluster of viral or viral-like receptors was identified, but 6 of the reference viral receptor sequences clustered with the canonical chemokine receptors.
Altogether, these results confirm the homology between the canonical receptors and atypical receptor 3/GPR182.However, these findings indicate that the other GPCRs, such as the chemokine-like receptors, formyl peptide receptors, GPR1, and GPR33, are also closely related to the canonical receptors.Remarkably, these results also indicate that ACKR1 is not homologous to the canonical chemokine receptors.Furthermore, all clusters of chemokine receptors contained only vertebrate sequences, except for the receptors of viral origin.
Canonical and chemokine-like receptors derive from single gene duplication in the ancestor of vertebrates.
Previous studies suggested that the chemokine receptors evolved from a duplication of angiotensin receptors (54) or adrenomedullin receptors (30,55).However, these works were based on error-prone phylogenetic methods such as Neighbour Joining (55).Our CLANS results indicate that chemokine receptors and chemokine-like receptors have only been observed in vertebrates.
Therefore, we need to focus on invertebrate genomes to clarify the chemokine receptor's outgroup.
To clarify this, we lowered the p-value thresholds of CLANS (to p-value < 1e-

50
) and collected a combined dataset including all chemokine receptor sequences and outgroups (i.e., sequences that connect to the chemokine receptor cluster), resulting in 3,026 sequences.We then performed a phylogenetic tree on this dataset using maximum likelihood methods with UFB and TBE for evaluating nodal support (see above and Materials and Methods for details).
Our analysis collected GPR35 and placed it within this large vertebrate specific clade indicating that is it also a vertebrate specific gene but not phylogenetically a 'canonical' chemokine receptor.The closest outgroup to this clade is composed of a few sequences from urochordates, the sister group of vertebrates (UFB=49 TBE=0.91)(Figure 3, S24 and S25).Interestingly, as the sister group of this clade, we identify a group composed of Relaxin receptors, which contain sequences from both urochordates and vertebrates (UBF=53 TBE=0.95).Finally, as the sister group of these large clades, we identified a clade of cephalochordate-specific sequences (UBF=44).
To clarify the duplication pattern and origin of the chemokine receptors, we used GeneRax (47) (see Materials and Methods).Our results indicate (Figure 4A, S26) that all chemokine receptors (except ACKR1) originated from a duplication in the stem lineage of vertebrates.This duplication of an unknown GPCR gave rise to the CML-plus, the canonical chemokine receptors and atypical 3/GPR182 groups as well as the intermediate group and other GPCRs (Figure 4A, S26).This result is consistent with the distribution of the paralogous Relaxin receptors which are present both in urochordates and vertebrates and the position of the orphan urochordate sequences as sister group of canonical chemokine receptors, CML-plus group and atypical3/GPR182 and other GPCRs (see above).Furthermore, the phylogenetic relationships amongst canonical chemokine receptors are overall consistent with the syntenic gene patterns known in human (7).The largest cluster of chemokine receptor genes spans 3 closely located loci on human chromosome 3 (7).It includes most CCRs, XCR, and CX3CR and corresponds to one of the two major monophyletic clades in our tree (Figure 4A).Another example is the mini cluster of CXCR1 and CXCR2, located on human chromosome 2 (7), which we also found to form a monophyletic clade (Figure 4A).We used the reconciliation to better understand the repertoires of receptors present at key nodes during vertebrate evolution.Our results (Figure 4B) show a substantial difference in the duplication pattern of different receptor families.For example, the complement of the atypical3/GPR182 remains constant throughout vertebrate evolution while the canonical and chemokine-like receptor groups expanded dramatically.The canonical chemokine receptors expanded from 5 to 20 genes and the CML-plus from 1 to 11 in the ancestor of the jawed vertebrates (Figure 4B).The expansion of the canonical CKRs is also not evenly distributed across its subgroups, with the ancestral CC type receptors undergoing a series of duplications in jawed vertebrates while the CXCR paralogs did not, specifically one (CXCR4) remains in single copy across all vertebrates.We inferred that in the stem lineage of vertebrates, five canonical chemokine receptor paralogs had already diverged, representing the two major types of receptors (2 CCR and 3 CXCR paralogs).Also, present in the stem lineage of vertebrates were ACKR3 and GPR182 as well as a single copy gene which would later diverge to produce all the CML-plus clade.

DISCUSSION
This work substantially clarifies the evolutionary assembly of the chemokine system.Our analysis shows that contrary to the receptors which evolved from a single duplication event in the vertebrate stem group, several unrelated molecules acquired the ability to interact with chemokine receptors over the course of evolutionary history.Furthermore, our results (summarized in Figure 5) suggest that the key components of the chemokine system, including the chemokine receptors themselves, evolved in the stem group of vertebrates in the Cambrian around 500 million years ago and then underwent substantial diversification in the stem group of jawed vertebrates.These findings shed new light on the complex evolutionary history of the chemokine system.
Unrelated molecules converged to chemokine function.
Based on the presence of shared protein motifs, TAFA "chemokines" (13,14), CXCL17 (59,60) and CYTL (19) have been proposed to be homologous to chemokine ligands.However, our findings strongly suggest that these molecules are not homologous (Figure 1) and likely acquired the ability to activate a chemokine-like response through convergent evolution.Our conclusions differ from those previous studies (13,19,59,60) due to the differences in data completeness and methodological approach.Specifically, we used a complete set of canonical and non-canonical ligands and assessed the homology using overall sequence similarity rather than single motifs.Our results support and expand upon the findings of ( 29), which suggested that the presence of a CXC or CC motif is necessary but not sufficient for a protein to be defined as a chemokine ligand.Similarly, CKLF has been considered a "new member" of the chemokine family based on its function (12) we argue that classification based solely on function is insufficient and can be misleading.Instead, we recommend considering the evolutionary relationships among these molecules as the primary criterion for classification.
Most of the canonical and non-canonical ligands are vertebrate innovations.
Our results clarify the distribution of canonical chemokine ligands in animals (Figure 2) and confirm that they are present only in vertebrates (30).We identify orthologs of CXCL and CCL ligands in both extant lineages of cyclostomes (Figure 2A).While chemokines have already been described in lamprey (52,61,62), it is the first time, to our knowledge, that they are described also in hagfish.
Our findings also indicate that both CC and CXC types were present in the common ancestor of all vertebrates and that few ancestral genes gave rise to the entire diversity of ligands that we know in current animals.Furthermore, our results indicate that many chemokines, such as CXCL1-7, CXCL16, as well as CCL25, CCL11/13, and CCL2/7, are uniquely present in mammals, suggesting that the mammal ligand repertoire is substantially more complex than the one observed in other vertebrates.
Regarding non-canonical chemokine-like families, our findings indicate that the TAFA family originated in the ancestor of vertebrates and urochordates; CYTL is a novelty of jawed vertebrates; and CXCL17 is mammal-specific and likely unrelated to canonical chemokines (similar to its controversial putative receptor, GPR35 (41)(42)(43), that is not a canonical chemokine receptor).The CKLF superfamily has a more complex pattern with the presence of few groups in invertebrates and then great expansions occurring at the base of vertebrates.The CKLFSF includes a monophyletic clade (CKLF group I) comprising the original CKLF that binds CCR4, as well as CMTM1,2,3,5, derived from duplications at the jawed vertebrates stem group.Interestingly, our analysis also revealed that additional molecules not previously considered part of the CKLF super family are closely related to classic members and should be included in it.For example, proteolipid protein 2 (PLP2) belongs to the CKLF I group and is, therefore more closely related to the CKLF with chemokine function than several other CKLFSF members.Similarly, CMTM8 is more closely related to plasmolipin (PLLP) and myelin and lymphocyte protein (MAL) than to any of the classic CKLFSF members.Although this relationship had been proposed based only on sequence similarity (34), our phylogenetic analysis provides additional evidence for it.Therefore, the potential chemokine function of all these additional members should be explored in vitro and in vivo in both vertebrates and invertebrates.
Most receptors derive from a single gene duplication.
Our results clarify the distribution of canonical chemokine receptors in vertebrates (Figure 4), and their evolutionary relationships and identify the pattern of duplication that leads to their origin (Figure 4A, S26).Unlike previous works (63), we identify that atypical receptors do not form a monophyletic group.Specifically, atypical 2 and 4 are part of the canonical clade specifically related to CC-type receptor subclades.Furthermore, we find that the atypical 3 receptors are related to GPR182, supporting previous functional data suggesting that the latter are atypical chemokine receptors binding CXCL10, 12, and 13 (53).We attribute these differences to our use of wider GPCR sampling and improved methods for phylogenetic inference.
Remarkably, our results do not identify ACKR1 as related to the main chemokine receptors but rather as a divergent clade (Figure S23).To our knowledge, this is the first time this observation has been made.Our current results do not allow us to clarify the evolutionary origin of ACKR1.
However, the presence of 7TMD domains suggests that they are GPCRs that independently acquired the ability to bind chemokines.Alternatively, similarly to other genes evolved in the immune system, ACKR1 may have been subjected to strong selective pressures that substantially changed their sequence, obscuring their phylogenetic relationships.The case of ACKR1 being the most distantly related receptor is intriguing as it is one of the most promiscuous chemokine receptors (2,64) and it has been shown to bind both CC and CXC chemokines (65,66).
Viral chemokine receptors represent a cryptic group that can bind multiple chemokines (22,23).Despite their functional similarity to canonical chemokine receptors, viral chemokine receptors' evolutionary origin and distribution remain poorly understood.Our results indicate that viral GPCRs do not form a monophyletic group, suggesting that the ability to encode chemokine-like receptors has evolved independently in multiple viruses, including cytomegaloviruses and poxviruses.The placement of viral sourced sequences within an otherwise vertebrate-specific clade supports the hypothesis that viruses acquired these genes through non-vertical inheritance.Given the paraphyly of viral receptors, this appears to have occurred multiple times.However, there are significant uncertainties and further work is needed to untangle details of viral chemokine receptors' evolution.
Our analysis reveals that the clade comprising apelin receptors, angiotensin receptors, bradykinin receptors, and orphan GPCRs (shown in Figure 3, 4 and S24-26) is closely related to chemokine receptors.This finding partially supports previous studies (54) that suggested a gene duplication event gave rise to both chemokine receptors and angiotensin receptors.Interestingly, we found that single gene duplication in the vertebrate stem group led to the emergence of canonical receptors and atypical 2,3,4, GPR182, chemokine-like receptors, formyl peptide receptors, the intermediate group, and many other known and orphan GPCRs including the controversial putative CXCL17 receptor GPR35.These findings suggest that two rounds of genome duplication (67, 68) played a role in the expansion of GPCR gene families.Future research will focus on investigating the functions of the orphan genes and many-to-one orthologs discovered in urochordates.This will provide further insight into the evolution and diversification of GPCR families in vertebrates.

The molecular assembly of the chemokine system
In this work, we explored the evolution of both ligand and receptor components of the chemokine signaling system, including non-canonical molecules with either chemokine-like function or sequence similarity.Our analysis suggests that the canonical chemokine signaling evolved in the vertebrate stem group (about 500 Mya) likely due to the two rounds of genome duplication that gave rise to many vertebrate novelties (67,68).We found that the ancestral vertebrate repertoire included orthologs of both major ligand groups (CXCL and CCL) and both CCR and CXCR receptors and non-canonical components such as TAFA and CKLFSF ligands, and the receptors Atypical 3 and GPR182 (Figure 5).The distribution of ligands and receptors in the ancestor of all vertebrates seems to confirm the hypothesis that the ancestral function of chemokines was homeostatic (e.g., CXCL12, CXCL14) with inflammatory functions arising from recent duplications (e.g., CXCL5, CXCL6), potentially reflecting a rapid evolution induced by the selective pressure of new pathogens (7).
Chemokine ligand and receptor genes are known to cluster on specific chromosomes (7), consistent with the hypothesis that they may be the result of the combination of en bloc duplication followed by tandem duplications (30,49,50).Due to limited high-quality genomes, syntenic patterns of chemokine genes described so far are based primarily on humans and a handful of other species (30,49,50), hampering our understanding of the level of conservation of these syntenic patterns.
Conversely, our large-scale phylogenetic analyses encompassed many species.We uncovered several phylogenetic relationships that are consistent with known syntenic patterns in human, providing stronger evidence for their evolutionary relationship.
The evolutionary history of canonical components includes several examples of known ligand-receptor pairs following a corresponding pattern of origin and temporal dynamics of duplications.This is true for example, for the ancient homeostatic CXCL12 ligand and its corresponding receptors CXCR4 and ACKR3, that all originated in early vertebrates (7).The early origin and conservation of CXCR4 and CXCL12 in the ancestor of vertebrates is interesting as this pair plays a key role in the migration of neural crest cells (69) -a key vertebrate innovation (70).This combined with the fact that homeostatic chemokine ligands/receptors tend to be restricted to monogamous pairing (2, 51) suggests that homeostatic chemokine pairings are more ancient and conserved being in single copy throughout much of the vertebrates.Contrastingly, inflammatory chemokine pairings are more promiscuous, and this could be linked to the more recent duplications in the genes, such as for CCL2/7/8/11/13 (Figure 2A) and their receptors CCR1/2/3/4/5 (Figure 4A).
For many of the non-canonical components, however, the ligand-receptor interactions are largely unclear, and their pattern throughout vertebrate evolution remains to be explored.Overall, our results indicate that three waves of molecular innovation in the vertebrates, jawed vertebrates, bony fishes and mammal stem group increased the chemokine system's molecular complexity (Figure 5), allowing for the fine-tuning present in modern-day animals.

MATERIALS AND METHODS
Data Mining and Dataset Assembly.We collected 64 proteomes from 25 vertebrates, six chordates and 33 other animals covering the whole animal tree (Table S1).BUSCO v4.0.6 (71,72) and the metazoa_odb10 set of 954 genes were used to evaluate their completeness (Table S1).
To identify potential homologs of canonical chemokines, TAFA chemokines and CYTL1, we used 207 curated sequences that we obtained from SwissProt (73,74) as seeds for an initial BLASTP (36,38) with e-value < e -10 . To identify putative chemokines in cyclostomes, the lamprey Petromyzon marinus (75) and the hagfish Eptatretus burger (76), we loosened the e-value to 0.05.Where putative chemokine sequences were found for one cyclostome species but not the other, those sequences were used to search again the other species.Furthermore, to investigate the presence of ligands outside vertebrates, we performed an additional BLASTP on invertebrate proteomes with an even looser e-value (0. for details).
To identify homologs for the CKLF superfamily, we used 21 SwissProt-reviewed sequences.
In addition to the BLASTP search, we used a position-specific iterative BLAST (PSI-BLAST) (37) with an e-value threshold of <e -10 . Using this approach, we identified a total of 590 putative homologs, including 186 from invertebrates.
We used BLASTP using 178 manually annotated receptor sequences from SwissProt as query sequences for the chemokine receptors.This includes all human canonical and atypical chemokine receptors (79).We also collected 8 viral sequences with chemokine receptor activity from UniProt (80) and performed a second BLASTP.We extracted all BLAST hits with e-values < e- Additionally, we annotated each cluster using gene annotations for key species Homo, Mus, Gorilla, Gallus, Anolis and Danio.In the case of the receptors, to improve the cluster annotation all human Class-A GPCRs (excluding olfactory receptors) from GPCRdb (84) were added to the dataset as well as the 8 seed viral chemokine receptors from UniProt (80).

Alignment and phylogenetic analysis
Alignment.All ligand and receptor sequences were aligned using MAFFT (85,86) with the --auto setting and using trimAl (87) to remove positions with >70% gaps.
Gene Trees.All gene alignments were analysed using IQTREE2 (88), the model test algorithm (89) was used to select the best substitution model for each analysis.Best models selected by IQTREE2 for each set are listed in Table S2 (for receptors we manually selected GTR20+F+G4 as the model as it was a large dataset).Nodal support was estimated using 1,000 ultrafast bootstrap (44,45) replicates.All analyses were repeated to run 100 non-parametric bootstrap repeats to calculate nodal support with transferable bootstrap expectation: which is specifically designed to account for phylogenetic instability (46).
For the receptors, due to the high computational burden of running TBE analyses on sequence-dense datasets, we first analysed the full set of 3,026 sequences connected in CLANS at a p-value of < 1 e -50 using UFB (Figure S25).Then, we extracted the chordate-specific clade sequences, including all chemokine receptor groups and their immediate outgroups, to analyse using TBE.
Gene tree species tree reconciliation.To understand the pattern of duplication and the evolution of gene complement we used GeneRax (47).GeneRax requires a gene tree that was obtained as described above and a species tree that we constructed manually using publicly available information.In the instances where the genes tree contained polytomies, we used ETE3 (90)       A) Presence of all receptor groups are mapped onto a species tree.Gene trees and duplication events are based on the gene tree to species tree reconciliation analyses.the TAFA cluster in the CLANS analysis.One of these sequences when blasted versus SwissProt returned a TAFA as hit.This sequence was also annotated as TAFA by InterProScan.Alignment of the urochordate sequences with vertebrate TAFA sequences reveals that only the one annotated as TAFA aligns well.While the other 3 align poorly and are also significantly longer than any of the other sequences.Further details about these sequences can be found in supplementary file S3 and in Supplementary Results.           Figure S18.Rooted species tree reconciled gene tree for canonical chemokines.The canonical chemokines gene tree was reconciled with the species tree using GeneRax."S" or "D" at the node indicates a speciation or duplication event respectively.CCL clade is in orange, CXCL clade in blue.
Figure S19.Rooted species tree reconciled gene tree for TAFA.The TAFA gene tree was reconciled with the species tree using GeneRax."S" or "D" at the node indicates a speciation or duplication event respectively.
Figure S20.Rooted species tree reconciled gene tree for CYTL.The CYTL gene tree was reconciled with the species tree using GeneRax."S" or "D" at the node indicates a speciation or duplication event respectively.
Figure S21.Rooted species tree reconciled gene tree for CXCL17.The CXCL17 gene tree was reconciled with the species tree using GeneRax."S" or "D" at the node indicates a speciation or duplication event respectively.
kept.Hit sequences were annotated by their top 5 BLAST hits against SwissProt.All hits from both BLASTs were merged and filtered by cd-hit(82,83) to remove redundant sequences at the 95% similarity threshold.This resulted in 7,157 putative chemokine GPCR sequences.Identification of subgroups with Cluster Analysis of Sequences (CLANS).We utilised CLANS(39,40) with default parameters and different p-values (i.e., stringency values) to visualize the relationships between subgroups of ligands and receptors.We assessed the similarity and interrelationships between different clusters by gradually relaxing the p-value threshold (Figures S1, S2 and S23).

Figure 1 : 29 (
Figure 1: Cluster Analysis and Phylogeny of Ligand groups.A) Similarity-based clustering, using

Figure 2 :
Figure 2: Distribution and duplication patterns of ligand groups.A) Presence of all ligand groups are

Figure 4 :
Figure 4: Distribution and duplication patterns of receptor groups.

Figure 5 :
Figure 5: Summary of the evolution of ligands and receptors.

Figure S7 .
Figure S7.Alignment of best candidate urochordate TAFA sequence with vertebrate TAFAs.Of the

Figure S8 .
Figure S8.Unrooted phylogenetic tree of canonical chemokines with TBE supports.Phylogenetic

Figure S10 .
Figure S10.Unrooted phylogenetic tree of TAFA with TBE supports.Phylogenetic tree under the

Figure S11 .
Figure S11.Unrooted phylogenetic tree of TAFA with UFB supports.Phylogenetic tree under the

Figure S12 .
Figure S12.Unrooted phylogenetic tree of CYTL with TBE supports.Phylogenetic tree under the

Figure S13 .
Figure S13.Unrooted phylogenetic tree of CYTL with UFB supports.Phylogenetic tree under the

Figure S14 .
Figure S14.Unrooted phylogenetic tree of CXCL17 with TBE supports.Phylogenetic tree under the

Figure S15 .
Figure S15.Unrooted phylogenetic tree of CXCL17 with UFB supports.Phylogenetic tree under the

Figure S16 .
Figure S16.Unrooted phylogenetic tree of CKLFSF with TBE supports.Phylogenetic tree under the

Figure S17 .
Figure S17.Unrooted phylogenetic tree of CKLFSF with UFB supports.Phylogenetic tree under the

Figure S22 .
Figure S22.Rooted species tree reconciled gene tree for CKLFSF.The CKLFSF gene tree was

Figure S23 .
Figure S23.CLANS clustering of receptors and related molecules sequences.A CLANS clustering

Figure S24 .
Figure S24.Unrooted phylogenetic tree of receptors with TBE supports.Phylogenetic tree of

Figure S25 .
Figure S25.Unrooted phylogenetic tree of receptors with UFB supports.Phylogenetic tree of all Fig.5

Table 1 :
Summary table of all the canonical and non-canonical chemokine components analyzed in this study.
to solve them.The undated DL mode and the closest approximation of the best-fitting substitution model were used for each alignment.To track the evolution of sub-lineages within each group, we used annotated sequences of key species (e.g., Homo sapiens and Mus musculus) as reference.For the (7) nomenclature for genes is primarily based on human chemokines.The canonical chemokines had 5 paralogs present in the vertebrate common ancestor.These undergo a heterogeneous pattern of duplication throughout vertebrates with different paralogs duplicating different number of times and in different groups of species.Chemokines that have been classically described as having either homeostatic or inflammatory function are indicated with a circle or a star respectively.The classification used here was based on Zlotnik and Yoshie 2012(7).B) Number of complements for each receptor group at key species nodes are mapped onto the species tree.The number of complements in each group reflects the pattern of duplications.The chemokine groups diverged in the vertebrate stem group.The major expansion occurred at the level of jawed vertebrates with canonical chemokine receptors, the chemokine-like receptor plus group and intermediate groups increasing in copy number.Canonical chemokine underwent another small subsequent increase within placentals.Silhouette images are by Andreas Hejnol (Xenopus laevis); Andy Wilson (Anas