Thrombospondin module 1 domain (TSP1) of the matricellular protein CCN3 shows an atypical disulfide pattern and incomplete CWR layers

Members of the CCN (Cyr61/CTGF/Nov) family are a group of matricellular regulatory proteins, essential to a wide range of functional pathways in cell signalling. Through interacting with extracellular matrix components and growth factors via one of its four domains, the CCN proteins are involved in critical biological processes such as angiogenesis, cell proliferation, bone development, fibrogenesis, and tumorigenesis. We present here the crystal structure of the thrombospondin module 1 (TSP1) domain of CCN3 (previously known as Nov), which shares a similar three-stranded fold with the thrombospondin type 1 repeats of thrombospondin-1 and Spondin-1, but with variations in the disulfide connectivity. Moreover, the CCN3 TSP1 lacks the typical pi-stacked ladder of charged and aromatic residues on one side of the domain, as seen in other TSP1 domains. Using conservation analysis among orthologous domains, we show that a charged cluster in the centre of the domain is the most conserved site and predict it to be a potential functional epitope for heparan sulphate binding. This variant TSP1 domain has also been used to revise the sequence determinants of TSP1 domains and derive improved Pfam sequence profiles for identification of novel TSP1 domains in more than 10,000 proteins across diverse phyla. Synopsis The first structure of a thrombospondin module 1 domain (TSP1) from a CCN family matricellular protein has been determined by X-ray crystallography. The structure shows a typical three-stranded fold, but with an incomplete pi-stacked structure that is usually found in these domains. The structure reveals highest conservation in the positively charged central segment, which we predict to be a binding site for heparan sulphates. The atypical features of this domain have been used to revise the definition of the TSP1 domains and identify a number of new domains in sequence databases.


Introduction
The CCN proteins are a family of intriguing matricellular proteins, playing regulatory roles in various cellular signalling processes and a range of critical biological functions. There are six members within this protein family in humans, designated CCN1-6 . The CCN acronym was derived from the three prototypical members: Cyr61 (cysteine-rich protein 61)/CCN1, CTGF (connective tissue growth factor)/CCN2, and Nov (Nephroblastoma overexpressed gene)/CCN3 (Bork, 1993). The Wnt-inducible proteins, Wisp1-3, are later defined as the three remaining members (CCN4-6) of the family (Pennica et al., 1998).
However, the molecular mechanism underlying the functions and regulations of the CCN proteins are poorly understood, mostly due to the large number of ligands that have been reported to interact with the CCN proteins and the variety of signalling pathways they are involved in. While CCN proteins have been shown to activate specific signalling pathways, direct receptors for these proteins have not yet been identified. There is also an increasing amount of evidence that CCN proteins, sometimes referred to as growth factors, affect several signalling pathways via direct interactions with cytokines and extracellular matrix components. For instance, CCN2 downregulates bone morphogenetic protein (BMP)-2 and -4-mediated Smad1/5/8 phosphorylation and the activation of mitogen-activated protein kinase (MAPK) pathways, thereby inhibiting embryogenesis and chondrocyte proliferation (Abreu et al., 2002;Maeda et al., 2009). Signalling by other growth factors, such as the transforming growth factor β (TGFβ), fibroblast growth factor (FGF), vascular endothelial growth factor (VEGF), and platelet-derived growth factor (PDGF) have also been shown to be affected by the CCN proteins (Abreu et al., 2002;Inoki et al., 2002;van Roeyen et al., 2008;Nishida et al., 2011;Aoyama et al., 2012).
Furthermore, the modulation of the signal transduction by the CCN proteins can also be achieved by association with extracellular matrix components. These include sulphated proteoglycans, fibronectin, decorin, low-density lipoprotein (LDL) receptor-related protein (LRP), Notch, and integrins (Chen et al., 2000;Yoshida & Munakata, 2007;Vial et al., 2011;Gao & Brigstock, 2003;Sakamoto et al., 2002;Tan et al., 2009). Despite this wealth of information, the molecular determinants of the interactions of CCN proteins with other molecules are still to be elucidated.

3
Another feature contributing to the complexity of their molecular functions is that, like many other extracellular proteins, the CCN proteins are a mosaic of structurally distinct domains. Four discrete cysteine-rich domains, an insulin-like growth factor binding domain (IB), a von Willebrand factor C domain (vWC), a Thrombospondin type 1 repeat (TSP1), and a C-terminal cystine-knot domain (CTCK), make up the primary structure of the CCN proteins (Bork, 1993). A short, variable hinge region separates the N-terminal IB and vWC domains from the C-terminal TSP1 and CTCK domains.
The CCN family members are highly conserved in their primary structure, with 31-50% pairwise sequence identity between the six paralogs, except for the absence of the CTCK domain in CCN5 (Brigstock, 2003). The 38 cysteine residues are spread out across the four domains and are nearly invariant, with the exception of CCN6 that lacks four cysteines in its vWC domain. Small-angle X-ray scattering (SAXS) analysis has provided us with the first glimpse of the structural arrangements of the four domains, which shows an extended, non-globular fold, with flexibility between the domains predicted to facilitate simultaneous ligand binding (Holbourn et al., 2011). A number of studies have shown that this hinge is prone to proteolysis and the resulting fragments of CCN proteins have been identified in various tissues (Yang et al., 1998;Perbal et al., 1999;Christine P. Burren et al., 1999;Su et al., 2001;Roestenberg et al., 2004). A recent work shows that cleavage of CCN2 in this hinge is required for CCN2-mediated activation of Akt and the ERK pathway, suggesting that the full-length CCN proteins are latent pro-forms (Kaasbøll et al., 2018).
Despite the wealth of data on CCN proteins' role in signalling, very little is known of their structure and interactions at molecular level. We have recently published the structure of the vWC domain of CCN3 (Xu et al., 2017), but no other high-resolution structures are known for CCN proteins or their domains. Several structures of IB domains have been determined (PDB IDs: 1h59, 1wqj, 2dsp, 3tjq, 3zxb) (Zesławski et al., 2001;Siwanowicz et al., 2005;Sitar et al., 2006;Eigenbrot et al., 2012;Trachsel et al., 2012) as well as one structure of CTCK domain (Zhou & Springer, 2014), but given their low sequence similarity with CCN proteins, relatively little functional predictions can be derived from these structures.
Here, we present the first crystal structure of the TSP1 domain from CCN3. TSP1 domains, also previously known as thrombospondin type 1 repeats (TSRs), were initially identified in the human endothelial cell thrombospondin-1 (Lawler & Hynes, 1986) and have turned out to be one of the most common motifs in extracellular proteins with close to 402 domains in 97 different human proteins residues, and is characterised by a well conserved pattern of residues containing six cysteines (two of which are variablefurther details below in Results), two arginines, and two tryptophans.
It has been reported that TSP1 domains inhibit angiogenesis through interactions with α3β1 and αvβ3 integrins, CD36 on the endothelial cell surface, as well as sequestering VEGF away from its receptors 4 (Dawson et al., 1997;Bornstein & Sage, 2002;Inoki et al., 2002). TSP1 domains also typically bear the glycosaminoglycan (GAG) binding sites, used for mediation of GAG-dependent cell adhesion (Clezardin et al., 1997). Neural guidance receptor Unc5 controls the Latrophilin GPCR-FLRT mediated cell adhesion, where its TSP1 domain is responsible for the octameric complex formation (Jackson et al., 2016). In the CCN proteins, the TSP1 domain of CCN2 has been found to display the most promising regenerative effect on chondrocytes and osteoarthritis, compared to the other individual domains and the full-length CCN2 (Abd El Kader et al., 2014). By solving the first structure of a TSP1 domain in the CCN family, we provide the first insights into the possible molecular functions of the CCN proteins.

Cloning and expression of CCN3 TSP1
The expression construct of rat CCN3 TSP1 domain (residues 195-249, Uniprot: Q9QZQ5) was amplified by PCR using overlapping oligonucleotides (forward -TATATCCATGGATTCTAGTA TCAACTGCATTGAGCAG, reverse -TATATAAGCTTATTCCCCAGGCTCTTGCTCACAAGG) from cDNA (kind gift from Dr. Paul Kemp) and cloned into pHAT4 vector (Peränen et al., 1996), which contains an N-terminal His6-tag followed by a TEV protease cleavage site. For protein expression, the construct was transformed into BL21(DE3) E. coli competent cells and grown on LB-agar plates containing 100 µg/ml of ampicillin overnight. Resulting colonies were cultured in 2-YT medium with 100 µg/ml ampicillin at 37°C under agitation, until cells reached OD600nm of 0.8-1.0. Protein expression was induced by 400 µM IPTG for 4h at 37°C. Cells were pelleted by centrifugation, resuspended in ddH2O, and stored at -20°C.

Protein refolding and purification
The CCN3 TSP1 domain was expressed insolubly in inclusion bodies and subsequently subjected to refolding to regain its native conformation. Harvested cells were first lysed using the Emulsiflex C5 homogeniser in lysis buffer (50 mM Tris-HCl pH 8.0, 2 mM EDTA, 10 mM DTT) with addition of 0.5% (v/v) Ralufon DM detergent. The lysate was incubated with 10 µg/ml DNase I and 4 mM MgCl2 for 20 min at room temperature. Inclusion bodies were separated upon centrifugation and washed twice by homogenisation in the lysis buffer containing either 0.5 % Ralufon DM or 1 M NaCl, and finally once with lysis buffer only. Denaturation was achieved by resuspension in 6 M guanidine hydrochloride, 50 mM Tris-HCl pH 8.0, 5 mM EDTA, and 25 mM Tris(2-carboxyethyl) phosphine.
The denatured protein was clarified by centrifugation, buffer exchanged to 6 M urea, 20 mM HCl, and adjusted to 1 mg/ml. Refolding was performed by 1:10 rapid dilution into 100 mM Tris-HCl pH 8.5, 100 mM ethanolamine pH 8.5, 1 M pyridinium propyl sulfobetaine, 2 mM cysteine, 0.2 mM cystine, and left for 7 days at 4°C. Refolded protein was purified first by Ni-NTA affinity chromatography, 5 followed by cleavage of the His-tag by TEV protease, and finally by reversed phase chromatography (ACE ® 5 C8-300). Purified protein was lyophilised and resuspended in ddH2O. MALDI mass spectrometry analysis was used to confirm the molecular weight and the formation of disulfide bonds.

Crystallisation
Purified CCN3 TSP1 domain at a concentration of 17.6 mg/ml was subjected to crystallisation experiments in 96-well plates in sitting drops consisting of 100 nl of protein and 100 nl of crystallisation solution using a number of commercial crystallisation screens. Initial crystal hits were improved by streak seeding using a rabbit whisker, and larger crystals subsequently appeared in 3.0 M NaCl, 0.1 M Tris pH 8.0, in 1µl + 1 µl hanging drops. For experimental phasing, derivative crystals were produced by soaking the native crystals in 6.7 mM K2PtCl4 in mother liquor overnight. The crystals were cryoprotected in 25% (v/v) glycerol and cryo-cooled using liquid nitrogen.

Data collection, structure determination and refinement
X-ray diffraction data of CCN3 TSP1 were collected at Europe Synchrotron Radiation Facility (ESRF), beamline ID14-4, using an ADSC Q315r CCD based X-ray detector (ADSC, CA), via remote control from Cambridge, UK. Multi-wavelength anomalous dispersion (MAD) phasing experiment was performed for the K2PtCl4 soaked CCN3 TSP1 crystals. Two datasets were recorded for these derivative crystals, at a wavelength of 1.0717 Å for peak anomalous signals, and at 1.0721 Å for inflection point.
High resolution diffraction data for the native crystals were obtained at wavelength 0.93 Å. Data were indexed and integrated using iMOSFLM 1.0.7 (Battye et al., 2011), and scaled using Aimless 1.1 (Evans & Murshudov, 2013) in the CCP4 suite 6.3.0 (Winn et al., 2011). AutoSHARP 2.8.2 (Vonrhein et al., 2007) was used for MAD phasing. The resulting structure was used as the search model for molecular replacement by Phaser 2.5.1 (McCoy et al., 2007) for the native dataset. Refinement was performed using Refmac 5.5 (Vagin et al., 2004) and phenix.refine (Adams et al., 2010). Coot 0.7 (Emsley & Cowtan, 2004) was used for model building and validation. Statistics of data collection and refinement are shown in Table 1. The coordinates and structure factors have been deposited in the Protein Data Bank with accession code 6RK1.  (Aken et al., 2016), and aligned and scored for position-specific conservation by the ConSurf Server (Ashkenazy et al., 2016).

Sequence Similarity Network construction
A sequence similarity network was constructed using the domain sequences from the four Pfam families of the TSP1 clan (CL0692): TSP_1 (PF00090), TSP1_spondin (PF19028), TSP1_ADAMTS (PF19030) and TSP1_CCN (PF19???). One thousand sequences from each family were randomly selected and used for an all-against-all BLAST search. All pairwise hits with a score above 35 bits were further loaded into Cytoscape (Shannon et al., 2003) to display the network using the default "Perfuse Force Directed Layout" method. In order to highlight the structural coverage of the families in the network, domain sequences in the ECOD database (Cheng et al., 2014) version 248 (23/08/2019) matching any of the TSP1 Pfam family models were included using the same procedure described here.

The crystal structure of CCN3 TSP1 domain
The structure of CCN3 TSP1 domain was determined by MAD phasing using K2PtCl4 derivative crystals (Table 1 and Fig. 1a). The three main heavy atom sites are from three platinum atoms bound to sulfur atoms in methionine residues. One Pt bound to methionine 231 in chain A out of two in the asymmetric unit with an occupancy of 1, and two Pt bound to two split conformations of the same M231 in the other chain with occupancies of 0.6 and 0.4, respectively. Two additional sites with low occupancies of 0.2-0.3 are from Pt bound to cysteines when dissociated from disulfide bonds as a result of radiation damage. The structure from MAD phasing was further refined to a resolution of 1.63 Å using the native dataset (Table 1 and Fig. 1b and c). The two molecules in the asymmetric unit are nearly identical, with an RMSD of 0.345 Å for 44 C atoms.
The CCN3 TSP1 structure exhibits an elongated fold consisting of three antiparallel strands, placing the N-and C-termini at the opposite ends of the domain. This small domain is stabilised by the three disulfide bonds from its six cysteines that are distributed all along the sequence. The top disulfide (when the domain is viewed with its N-terminus pointing up) is formed between C 1 200-C 4 229 (superscripted number refers to the sequential position of the cysteine in the domain) between strands I and III, C 2 210-C 5 238 links strand I to the end of the β-sheet in strand III in the middle of the domain, and third disulfide between C 3 214-C 6 246 links the turn between strands I and II with the very C-terminus of the domain (Fig. 1d).
Strand I (N199-C214) is more irregular and rippled, while strands II (G218-L223) and III (Q232-E237) form a regular anti-parallel β-sheet (Fig. 1d). In addition to secondary structure-defining interactions between strands II and III, the structure is stabilised by hydrogen bonds formed between the irregular strand I and strand II.

Conservation analysis of TSP1 in CCN proteins
The surface of CCN3 TSP1 domain shows now significant cavities, as potential binding sites for ligands. Projection of electrostatic potential on the surface shows a strongly positively charged zone around the centre of the domain. TSP1 domains are known to bind heparan sulphates (HS) (Guo et al., 2006) and this patch could form a part of a HS binding site. In the absence of mutagenesis or other data on functional sites on TSP1 domain, we turned to the analysis of the evolutionary conservation of the domain. We took all available CCN family proteins from Ensembl genome database and aligned these.
This alignment across higher eukaryotes was mapped on to the CCN3 TSP1 structure using ConSurf server by colouring according to conservation scores (Fig. 2). In addition to the almost invariant cysteines, the highest conservation mapped to the central part of the domain, with residues W207, S219, R221, Q234, and R236 showing 100% conservation in all CCN family TSP1 domains. These residues are localised in the positively charged cluster and point to the "front" of the domain, as shown in Figure   1. As they are part of the charged/aromatic spine of the domain, it is impossible to say whether they are conserved for functional or structural reasons, or both.

Similarity and diversity in TSP1 domains
To analyse the structure further, we collected all other TSP1 domain structures from the PDB. Firstly, we used Dali server (http://ekhidna2.biocenter.helsinki.fi/dali/, (Holm & Laakso, 2016)  domains are the so-called 'CWR layers', consisting of the side chain stacking of cystines, tryptophans, and arginines, as described by Tan et al. (2002). In the thrombospondin-1 repeat 2/3, an array of tryptophan and arginine residues form multiple π-cation interactions between the aromatic rings and the planar cationic guanidinium groups; these arginines are often found paired with large polar residues forming side-by-side hydrogen bondings across the strands. Together with the three "C" layers of disulfide bridges, these stacked residues form a stabilising spine for this small domain and appear to provide structural rigidity to it in the absence of a hydrophobic core. The first W layer is in some cases replaced by another hydrophobic residue (Leu or Tyr in Spondin-1), but otherwise the CWR-stacked structure is conserved in Spondin-1 repeat 1/4 (Fig. 3a) and structures of other TSP1 domain containing proteins, thrombospondin-repeat anonymous protein (TRAP), complement component C6 and C8 (Tossavainen et al., 2006;Lovelace et al., 2011;Song et al., 2012;Aleshin et al., 2012). By contrast, in the CCN3 TSP1 domain the three R layers and three C layers are conserved, but only the third W layer is present (Fig. 3). The W layers in all TSP1 domains are located in strand I, with their aromatic side chains extending towards the centre of the structure. The R layers in thrombospondin-1 and Spondin-1 TSP1 domains are all formed between strands II and III. In CCN3 TSP1, the top R layer is formed between strands I and II instead. The consequence of the missing top W layer (and the absence of another hydrophobic residue that could replace it), is that the CCN3 TSP1 domain is more open, with strand I more separate from the rest of the structure compared with domains that pull the N-terminus towards the core of the domain by the W layer interactions (Fig. 3).
Another variable feature among the TSP1 domains is the disulfide bond pattern. The three C layers comprise of one layer at the very top of the structure (when viewed with N-terminus at the top), and two consecutive layers at the bottom, alternated with W and R layers. The bottom two C layers are conserved among all TSP1 domains, formed between strands I and III whereas the top C layer varies in its position and connectivity. In CCNs and Spondins, C 4 is located at the top of strand III and disulfide bonded to C 1 at the very N-terminus of strand I. In thrombospondin, C 4 forms a disulfide with C 3 (which is missing from CCN-like domains) in the middle of the sequence, at the top of strand II. The differences in the disulfide connectivity in the first C layer at the top of the domain and the lack of the first W layer results in larger differences in the structure of CCN3 TSP1 domain compared with other similar domains. The central W layer in CCN3 ensures that the core of the domain aligns well with other TSP1s.
Typical to a large family of disulfide-rich domains, there are always more subtle variations to the connectivity. For example, circumsporozoite protein TSP1 domain (PDB 3vdl and 6b0s) (Doud et al., 2012;Scally et al., 2018) lack the top disulfide and has a long helices containing insertion in loop II-IIII, whereas Micronemal protein 2 (PDB 4okr) (Song & Springer, 2014) contains also a long insertion in loop II-III with an additional pair of disulfide linked cysteines (supplemental Fig. S2).
Overall, the "canonical" TSP1 domains with C 3 -C 4 connectivity are more structurally conserved with very well defined layered structure, whereas domains with C 1 -C 4 connectivity have more variable structures and are difficult to align unambiguously.
This difference in the top C layer can be used to categorise different TSP1 domains in matricellular proteins. Sequence alignment of selected TSP1 domains with alternative disulfide connectivities shows 11 that while CCN and Spondin proteins share the same C 1 -C 4 disulfide pattern, ADAMTS (an extracellular protease), UNC5C (receptor for Netrin), Properdin (a plasma protein), together with thrombospondin, form the alternative group (Fig. 4). However, the functional implication of this structural division of disulfide connectivity is as yet unclear. domains with either C 1 -C 4 (far left) or C 3 -C 4 (far right) connectivity are also shown.

Redefinition of TSP1 domain
The existing Pfam family sequence profile (release 32.0) was entirely built from sequences that had the C 3 -C 4 connectivity. This meant that matches to C 1 -C 4 connectivity domains were partial and thus missing the first critical cysteine residue. To remedy this, two new Pfam families were constructed to represent the two subtypes of C 1 -C 4 connectivity TSP1 domains related to those found in spondins and CCN proteins. These domain families have been deposited in Pfam with accession names TSP1_spondin (PF19028) and TSP1_CCN (PF19???). A new Pfam clan was also built to represent TSP1 domains with accession CL0692 (new TSP1 families and the new clan will be released with version 33.0 of Pfam database). We were further interested to know how common each of these connectivities were across known TSP1 domains. To investigate this, we constructed a Sequence Similarity Network (SSN) of all domains and highlighted the different families. The SSN showed an additional group of sequences that were matching the original TSP1 family but formed their own separated cluster, so a fourth TSP1 Pfam family was built to represent them. This new family (PF19030) was found to correspond to a set of domains from ADAMTS proteins which lack one of the tryptophans, but replace one of the conserved arginines with a hydrophobic residue.
Examples of this domain, for which no structures have been experimentally determined yet, can be 13 seen as the second and third domains in ADAMTS1 in Figure 4. The complete SSN displaying the relationships among domains in these three families is shown in Figure 5, along with consensus sequence logo for each of the three families.
Updating the Pfam domain definitions has several important consequences. Firstly, the overall detection of TSP1 domains has increased by 17% from 57,847 to 69,393 (Fig. 5)

Discussion
Second X-ray crystallographic structure of a domain from the enigmatic CCN family of matricellular proteins has revealed variant form of TSP1 domain with limited pi-cation ladder typical to these proteins. While functional prediction from this structure of a small domain with limited conservation being difficult, the structure has helped with definition of the larger TSP1 domain population more accurately in Pfam domain database and better annotation of these domains in sequence databases.
Methods for the production of CCN proteins and their fragments in high quality will allow us to use them for further analysis of their molecular functions, identification of interaction partners and biophysical characterisation of these interactions in vitro. With significant interest in these proteins as therapeutic targets, in fibrotic conditions in particular, correctly folded proteins will facilitate the development of neutralising antibodies against CCN proteins as well.

Figure S1
Sequence conservation analysis of the CCN proteins in a range of species generated by the ConSurf Server (https://consurf.tau.ac.il) (Ashkenazy et al., 2010).