The thrombospondin module 1 domain of the matricellular protein CCN3 shows an atypical disulfide pattern and incomplete CWR layers

The members of the CCN (Cyr61/CTGF/Nov) family are a group of matricellular regulatory proteins that are essential to a wide range of functional pathways in cell signalling. Through interacting with extracellular matrix components and growth factors via one of their four domains, the CCN proteins are involved in critical biological processes such as angiogenesis, cell proliferation, bone development, fibrogenesis and tumorigenesis. Here, the crystal structure of the thrombospondin module 1 (TSP1) domain of CCN3 (previously known as Nov) is presented, which shares a similar three-stranded fold with the thrombospondin type 1 repeats of thrombospondin-1 and spondin-1, but with variations in the disulfide connectivity. Moreover, the CCN3 TSP1 domain lacks the typical -stacked ladder of charged and aromatic residues on one side of the domain that is seen in other TSP1 domains. Using conservation analysis among orthologous domains, it is shown that a charged cluster in the centre of the domain is the most conserved site and this cluster is predicted to be a potential functional epitope for heparan sulfate binding. This variant TSP1 domain has also been used to revise the sequence determinants of TSP1 domains and to derive improved Pfam sequence profiles for the identification of novel TSP1 domains in more than 10 000 proteins across diverse phyla.


Introduction
The CCN proteins are a family of intriguing matricellular proteins that play regulatory roles in various cellular signalling processes and a range of critical biological functions. There are six members within this protein family in humans, designated CCN1-CCN6 . The acronym CCN was derived from the three prototypical members: Cyr61 (cysteine-rich protein 61)/CCN1, CTGF (connective tissue growth factor)/CCN2 and Nov (nephroblastoma-overexpressed gene)/CCN3 (Bork, 1993). The Wnt-inducible proteins Wisp1-Wisp3 were later defined as the three remaining members (CCN4-CCN6) of the family (Pennica et al., 1998).
The CCN proteins have been shown to be involved in developmental processes such as angiogenesis, osteogenesis, proliferation and differentiation (Kubota & Takigawa, 2007;Katsube et al., 2009;Kawaki et al., 2011;Hara et al., 2016), and they are suggested, typically through overexpression, to contribute to diseased states in inflammation, fibrosis and various types of cancer (Kular et al., 2011;Riser et al., 2015;Li et al., 2015;Kim et al., 2018). However, the molecular mechanism underlying the functions and regulation of the CCN proteins are poorly understood, mostly owing to the large number of ligands that have been reported to interact with the CCN proteins and the variety of signalling pathways ISSN 2059-7983 # 2020 International Union of Crystallography that they are involved in. While the CCN proteins have been shown to activate specific signalling pathways, direct receptors for these proteins have not yet been identified. There is also an increasing amount of evidence that the CCN proteins, which are sometimes referred to as growth factors, affect several signalling pathways via direct interactions with cytokines and extracellular matrix components. For instance, CCN2 downregulates bone morphogenetic protein-2 (BMP-2) and BMP-4-mediated Smad1/5/8 phosphorylation and the activation of mitogen-activated protein kinase (MAPK) pathways, thereby inhibiting embryogenesis and chondrocyte proliferation (Abreu et al., 2002;Maeda et al., 2009). Signalling by other growth factors, such as transforming growth factor (TGF), fibroblast growth factor (FGF), vascular endothelial growth factor (VEGF) and platelet-derived growth factor (PDGF), has also been shown to be affected by the CCN proteins (Inoki et al., 2002;van Roeyen et al., 2008;Nishida et al., 2011;Aoyama et al., 2012;Abreu et al., 2002). Furthermore, the modulation of signal transduction by the CCN proteins can also be achieved by association with extracellular matrix components. These include sulfated proteoglycans, fibronectin, decorin, low-density lipoprotein (LDL) receptorrelated protein (LRP), Notch and integrins (Chen et al., 2000;Yoshida & Munakata, 2007;Vial et al., 2011;Gao & Brigstock, 2003;Sakamoto et al., 2002;Tan et al., 2009). Despite this wealth of information, the molecular determinants of the interactions of CCN proteins with other molecules still have to be elucidated.
Another feature contributing to the complexity of their molecular functions is that, like many other extracellular proteins, the CCN proteins are a mosaic of structurally distinct domains. Four discrete cysteine-rich domains, an insulin-like growth factor-binding domain (IB), a von Willebrand factor C domain (vWC), a thrombospondin type 1 repeat (TSP1) and a C-terminal cystine-knot domain (CTCK), make up the primary structure of the CCN proteins (Bork, 1993). A short, variable hinge region separates the N-terminal IB and vWC domains from the C-terminal TSP1 and CTCK domains. The CCN family members are highly conserved in their primary structure, with 31-50% pairwise sequence identity between the six paralogues, except for the absence of the CTCK domain in CCN5 (Brigstock, 2003). The 38 cysteine residues are spread out across the four domains and are nearly invariant, with the exception of CCN6, which lacks four cysteines in its vWC domain. Small-angle X-ray scattering (SAXS) analysis has provided us with the first glimpse of the structural arrangements of the four domains, and shows an extended, nonglobular fold, with flexibility between the domains predicted to facilitate simultaneous ligand binding (Holbourn et al., 2011). A number of studies have shown that this hinge is prone to proteolysis and the resulting fragments of CCN proteins have been identified in various tissues (Yang et al., 1998;Perbal et al., 1999;Burren et al., 1999;Su et al., 2001;Roestenberg et al., 2004). Recent work shows that cleavage of CCN2 in this hinge is required for CCN2-mediated activation of Akt and the ERK pathway, suggesting that the full-length CCN proteins are latent pro-forms (Kaasbøll et al., 2018).
Despite the wealth of data on the role of CCN proteins in signalling, very little is known of their structure and interactions at the molecular level. We have recently published the structure of the vWC domain of CCN3 (Xu et al., 2017), but no other high-resolution structures are known of CCN proteins or their domains. Several structures of IB domains have been determined (PDB entries 1h59, 1wqj, 2dsp, 3tjq and 3zxb;Zesławski et al., 2001;Siwanowicz et al., 2005;Sitar et al., 2006;Eigenbrot et al., 2012;Trachsel et al., 2012), as well as one structure of a CTCK domain (Zhou & Springer, 2014), but given their low sequence similarity to CCN proteins relatively few functional predictions can be derived from these structures.
Here, we present the first crystal structure of the TSP1 domain from CCN3. TSP1 domains, also previously known as thrombospondin type 1 repeats (TSRs), were initially identified in human endothelial cell thrombospondin-1 (Lawler & Hynes, 1986) and have turned out to be one of the most common motifs in extracellular proteins, with close to 402 domains in 97 different human proteins (according to release 32.0 of the Pfam database; El-Gebali et al., 2019; https:// pfam.xfam.org/family/PF00090). This small domain contains approximately 50 amino-acid residues and is characterized by a well conserved pattern of residues containing six cysteines (two of which are variable; further details are given in Section 3), two arginines and two tryptophans.
It has been reported that the TSP1 domains in thrombospondin-1 inhibit angiogenesis through interactions with 31 and v3 integrins and CD36 on the endothelial cell surface, while CCN2 TSP1 sequesters VEGF away from its receptors (Dawson et al., 1997;Bornstein, 2001;Inoki et al., 2002). TSP1 domains also typically bear glycosaminoglycan (GAG)binding sites, which are used for the mediation of GAGdependent cell adhesion (Clezardin et al., 1997). The neural guidance receptor Unc5 controls latrophilin GPCR-FLRTmediated cell adhesion, in which its TSP1 domain is responsible for formation of the octameric complex (Jackson et al., 2016). TSP1 domains have often been found to be modified by atypical glycosylations, in particular mannosylation on tryptophan residues in a WXXW motif, and fucosylation on serines and threonines in a CX 2-3 S/TC motif (Hofsteenge et al., 2001). Recent structural analysis of preponderin has found 15 mannosylation sites in its six TSP1 domains, but the exact role of these modifications is not clear (Pedersen et al., 2019). The CCN1 TSP1 domain has been shown to be fucosylated on Ser242, a residue that is conserved across CCN proteins (Niwa et al., 2015). As CCN proteins contain only one conserved tryptophan in the TSP1 domain, they are unlikely to be mannosylated, as the recognition motif for this is WXXW (Julenius, 2007).
In the CCN proteins, the TSP1 domain of CCN2 has been found to display the most promising regenerative effect on chondrocytes and osteoarthritis compared with the other individual domains and full-length CCN2 (Abd El Kader et al., 2014). By solving the first structure of a TSP1 domain in the CCN family, we hope to facilitate new experiments that will help to define the still unclear molecular functions of the CCN proteins.

Cloning and expression of the CCN3 TSP1 domain
The expression construct for the rat CCN3 TSP1 domain (residues 195-249; UniProt Q9QZQ5) was amplified by PCR using overlapping oligonucleotides (forward, TATATCCATG GATTCTAGTATCAACTGCATTGAGCAG; reverse, TATA TAAGCTTATTCCCCAGGCTCTTGCTCACAAGG) from cDNA (a kind gift from Dr Paul Kemp) and cloned into pHAT4 vector (Perä nen et al., 1996), which contains an N-terminal His 6 tag followed by a TEV protease cleavage site. For protein expression, the construct was transformed into Escherichia coli BL21(DE3) competent cells and grown on LB-agar plates containing 100 mg ml À1 ampicillin overnight. The resulting colonies were cultured in 2ÂYT medium with 100 mg ml À1 ampicillin at 37 C under agitation until the cells reached an OD 600 nm of 0.8-1.0. Protein expression was induced with 400 mM IPTG for 4 h at 37 C. The cells were pelleted by centrifugation, resuspended in double-distilled H 2 O (ddH 2 O) and stored at À20 C.

Protein refolding and purification
The CCN3 TSP1 domain was expressed insolubly in inclusion bodies and subsequently subjected to refolding to regain its native conformation. The harvested cells were first lysed using an Emulsiflex C5 homogenizer in lysis buffer (50 mM Tris-HCl pH 8.0, 2 mM EDTA, 10 mM DTT) with the addition of 0.5%(v/v) Ralufon DM detergent. The lysate was incubated with 10 mg ml À1 DNase I and 4 mM MgCl 2 for 20 min at room temperature. Inclusion bodies were separated by centrifugation and were washed twice by homogenization in lysis buffer containing either 0.5% Ralufon DM or 1 M NaCl and finally once with lysis buffer only. Denaturation was achieved by resuspension in 6 M guanidine hydrochloride, 50 mM Tris-HCl pH 8.0, 5 mM EDTA, 25 mM tris(2-carboxyethyl)phosphine. The denatured protein was clarified by centrifugation, buffer-exchanged into 6 M urea, 20 mM HCl and adjusted to 1 mg ml À1 . Refolding was performed by a 1:10 rapid dilution into 100 mM Tris-HCl pH 8.5, 100 mM ethanolamine pH 8.5, 1 M pyridinium propyl sulfobetaine, 2 mM cysteine, 0.2 mM cystine and was left for seven days at 4 C. Refolded protein was first purified by Ni-NTA affinity chromatography by loading it directly onto a benchtop column packed with Ni-NTA beads (GE Healthcare) equilibrated with 50 mM Tris-HCl pH 8.0, 0.5 M NaCl, 20 mM imidazole and eluting with 50 mM Tris-HCl pH 8.0, 0.5 M NaCl, 300 mM imidazole. The eluate was dialyzed against 20 mM Tris-HCl pH 8.0, 200 mM NaCl (10 h per cycle for three cycles) at 4 C. The addition of 0.4 mg Tobacco etch virus (TEV) protease along with cysteine to a final concentration of 2 mM cleaved the His-tag at the N-terminus, and the protein was further purified by flowing it through the Ni-NTA column. The cleaved protein was purified to homogeneity (as verified by SDS-PAGE) by reversed-phase chromatography using an ACE C8-300 5 mm column eluted with 10-90% acetonitrile containing 0.1% trifluoroacetic acid. The purified protein was lyophilized and resuspended in ddH 2 O. MALDI massspectrometric analysis was used to confirm the molecular weight and the formation of disulfide bonds.

Crystallization
Purified CCN3 TSP1 domain at a concentration of 17.6 mg ml À1 was subjected to crystallization experiments in 96-well plates in sitting drops consisting of 100 nl protein solution and 100 nl crystallization solution using a number of commercial crystallization screens. The initial crystal hits were improved by streak-seeding using a rabbit whisker, and larger crystals subsequently appeared in 3.0 M NaCl, 0.1 M Tris pH 8.0 in 1 + 1 ml hanging drops. For experimental phasing, derivative crystals were produced by soaking the native crystals in 6.7 mM K 2 PtCl 4 in mother liquor overnight. The crystals were cryoprotected in mother liquor supplemented with 25%(v/v) glycerol and cryocooled using liquid nitrogen.

Data collection, structure determination and refinement
X-ray diffraction data for CCN3 TSP1 were collected on beamline ID14-4 at the European Synchrotron Radiation Facility (ESRF) using an ADSC Q315r CCD-based X-ray detector (ADSC, California, USA) via remote control from Cambridge, England. A multiwavelength anomalous dispersion (MAD) phasing experiment was performed using the K 2 PtCl 4 -soaked CCN3 TSP1 crystals. Two data sets were recorded from these derivative crystals at wavelengths of 1.0717 Å for the peak anomalous signal and 1.0721 Å for the inflection point. High-resolution diffraction data were obtained from the native crystals at a wavelength of 0.93 Å . The data were indexed and integrated using iMosflm v.1.0.7 (Battye et al., 2011) and were scaled using AIMLESS v.1.1 (Evans & Murshudov, 2013) from the CCP4 suite v.6.3.0 (Winn et al., 2011). The processed data collected from the platinumderivative crystal were subjected to experimental phasing using the AutoSHARP v.2.8.2 pipeline (Vonrhein et al., 2007). SHELX (Schneider & Sheldrick, 2002), which found three heavy-atom sites with correlation coefficients (CCs) of 0.239. SOLOMON (Abrahams & Leslie, 1996) was used to determinate the handedness and for density modification, followed by automatic model building with the ARP/wARP suite (Perrakis et al., 1999), which resulted in a model with R and R free factors of 0.250 and 0.301, respectively, at 2.33 Å resolution. This was used as the search model for molecular replacement by Phaser v.2.5.1 (McCoy et al., 2007) using the high-resolution native data set at 1.65 Å . Refinement was performed using REFMAC v.5.5 (Vagin et al., 2004) and phenix.refine (Adams et al., 2010;Liebschner et al., 2019) without imposing noncrystallographic symmetry, while Coot (Emsley et al., 2010) was used for model building and validation. Data-collection and refinement statistics are shown in Table 1. The coordinates and structure factors have been deposited in the Protein Data Bank with accession code 6rk1.

research papers 2.5. Conservation analysis
The ConSurf server (https://consurf.tau.ac.il) was used to evaluate the degree of evolutionary conservation of each amino-acid position in the CCN3 TSP1 domain. The protein sequences of TSP1 domains in different CCN-family members across a range of species were extracted from the Ensembl database v.76 (http://www.ensembl.org/index.html; Aken et al., 2016) and were aligned and scored for position-specific conservation using the ConSurf server (Ashkenazy et al., 2010(Ashkenazy et al., , 2016.

Sequence-similarity network construction
A sequence-similarity network was constructed using the domain sequences from the four Pfam families of the TSP1 clan (CL0692): TSP_1 (PF00090), TSP1_spondin (PF19028), TSP1_ADAMTS (PF19030) and TSP1_CCN (PF19035). 1000 sequences from each family were randomly selected and used in an all-against-all BLAST search. All pairwise hits with a score above 35 bits were further loaded into Cytoscape (Shannon et al., 2003) to display the network using the default 'Perfuse Force Directed Layout' method. In order to highlight the structural coverage of the families in the network, domain sequences in the ECOD database v.248 (23 August 2019; Cheng et al., 2014) matching any of the TSP1 Pfam family models were included using the same procedure as described here.

Results and discussion
3.1. The crystal structure of the CCN3 TSP1 domain The structure of the CCN3 TSP1 domain was determined by MAD phasing using K 2 PtCl 4 -derivative crystals (Table 1 and Fig. 1a). The three main heavy-atom sites are from three Pt atoms bound to S atoms in methionine residues. One Pt atom bound to Met231 in chain A of the two chains in the asymmetric unit with an occupancy of 1, and two Pt atoms bound to the two split conformations of the same Met231 in the other chain, with occupancies of 0.6 and 0.4, respectively. The structure from MAD phasing was further refined to a resolution of 1.63 Å using the native data set (Table 1 and Figs. 1b and 1c). The two molecules in the asymmetric unit are nearly identical, with an r.m.s.d. of 0.345 Å for 44 C atoms.
The CCN3 TSP1 structure exhibits an elongated fold consisting of three antiparallel strands, placing the N-and C-termini at opposite ends of the domain. This small domain is stabilized by the three disulfide bonds from its six cysteines, which are distributed all along the sequence. The top disulfide (when the domain is viewed with its N-terminus pointing up) is formed between Cys 1 200 and Cys 4 229 (the superscript numbers refer to the sequential positions of the cysteines in the domain) between strands I and III, that between Cys 2 210 and Cys 5 238 links strand I to the end of the -sheet in strand III in the middle of the domain, and the third disulfide between Cys 3 214 and Cys 6 243 links the turn between strands I and II to the very C-terminus of the domain (Fig. 1d).
Strand I (Asn199-Cys214) is more irregular and rippled, while strands II (Gly218-Leu223) and III (Gln232-Glu237) form a regular antiparallel -sheet (Fig. 1d). In addition to secondary-structure-defining interactions between strands II and III, the structure is stabilized by hydrogen bonds formed between the irregular strand I and strand II. These include the main chain-main chain interactions Gln203 O-Asn226 N, Thr205 N-Val222 O, Ser208 N-Thr220 O and Ser211 N-Leu218 O, a pair of side-chain-side-chain hydrogen bonds from Glu202 O to the N and N " atoms of Arg225, and a few other main chain-side chain hydrogen bonds between strands I and II (Fig. 1e).

Conservation analysis of TSP1 in CCN proteins
The surface of the CCN3 TSP1 domain shows significant cavities as potential binding sites for ligands. Projection of the electrostatic potential onto the surface shows a strongly positively charged zone around the centre of the domain. The   , 1997) and this patch could form a part of an HS binding site, while integrins are also known binding partners (Bornstein, 2001). In the absence of mutagenesis or other data on functional sites on the TSP1 domain, we turned to analysis of the evolutionary conservation of the domain. We took all available CCN-family proteins from the Ensembl genome database and aligned these. This alignment across higher eukaryotes was mapped onto the CCN3 TSP1 structure using the ConSurf server by colouring according to conservation score (Fig. 2). In addition to the almost invariant cysteines, the highest conservation mapped to the central part of the domain, with Trp207, Ser219, Arg221, Gln234 and Arg236 showing 100% conservation in all CCN-family TSP1 domains. These residues are localized in the positively charged cluster and point to the 'front' of the domain, as shown in Fig. 1. As they are part of the charged/aromatic spine of the domain, it is impossible to say whether they are conserved for functional reasons or for structural reasons, or both. The only known post-translational modification (apart from disulfide-bond formation) in CCN TSP1 domains is the fucosylation of Thr242 in CCN1 (Ser213 in CCN3; Niwa et al., 2015). This residue is at the end of the first strand, between Cys 1 and Cys 4 , and is found in all CCN proteins as either a serine or a threonine, both of which can be fucosylated. As expected, this residue is fully solvent-exposed and thus possible to modify post-translationally, and is relatively far from the most conserved patch on the other side of the domain. Curiously, another serine nearby (Ser211 in CCN3) is almost 100% conserved across the CCN proteins. As we produced our protein in bacteria through refolding, we did not have glycosylation on the protein that was crystallized. While the modification could have a functional role (mutation of Ser242 in CCN1 affects secretion), we would not expect it to make a significant structural difference to the domain.

Similarity and diversity in TSP1 domains
To analyze the structure further, we collected all other TSP1-domain structures from the PDB. Firstly, we used the DALI server (http://ekhidna2.biocenter.helsinki.fi/dali/; Holm & Laakso, 2016) to find the closest homologue from the PDB90 set, which contains proteins with a maximum of 90% pairwise identity. This revealed sporozoite surface protein 2 (PDB entry 4hqo; Song et al., 2012) as the closest homologue of the CCN3 TSP1 domain, but thrombospondin, F-spondin and complement C6 and C8 proteins were also identified in this search as related structures. Further TSP1 structures were taken from the Pfam database listing (family PF00090). A list of currently available TSP1-domain structures is shown in Supplementary Table S1.
The elongated three-stranded fold is observed in all TSP1 domains. A distinctive feature of TSP1 domains are the socalled 'CWR layers', consisting of the side-chain stacking of cystines, tryptophans and arginines, as described by Tan et al. (2002). In thrombospondin-1 repeat 2/3, an array of tryptophan and arginine residues form multiple -cation interactions between the aromatic rings and the planar cationic guanidinium groups; these arginines are often found paired with large polar residues, forming side-by-side hydrogen bonds across the strands. Together with the three C layers of disulfide bridges, these stacked residues form a stabilizing spine for this small domain and appear to provide structural rigidity to it in the absence of a hydrophobic core. The first W layer is in some cases replaced by another hydrophobic residue (Leu or Tyr in spondin-1), but otherwise the CWRstacked structure is conserved in spondin-1 repeat 1/4 (Fig. 3a) and the structures of other TSP1 domain-containing proteins, thrombospondin-repeat anonymous protein (TRAP) and complement components C6 and C8 (Lovelace et al., 2011;Song et al., 2012;Aleshin et al., 2012;Tossavainen et al., 2006). In contrast, in the CCN3 TSP1 domain the three R layers and three C layers are conserved in sequence, but only the third W layer is present (Fig. 3). The W layers in all TSP1 domains are located in strand I, with their aromatic side chains extending towards the centre of the structure. The R layers in the thrombospondin-1 and spondin-1 TSP1 domains are all formed between strands II and III. In CCN3 TSP1, the top R layer is formed between strands I and II instead and the side chain of the bottom R layer points away from the domain, stacking against the same residue from the other domain in the Structure of the CCN3 TSP1 domain reflecting evolutionary conservation. From left to right: ribbon (a), sticks (b) and surface (c) (front and back views) representations, coloured by projecting the conservation scores of the residues onto the structures (the sequence alignment is shown in Supplementary  Fig. S1). The residues clustered in the most conserved surface patch are labelled. asymmetric unit. Whether this is only a consequence of crystal packing or reflects a true lack of the bottom R layer is not clear. The consequence of the missing top W layer (and the absence of another hydrophobic residue that could replace it) is that the CCN3 TSP1 domain is more open, with strand I more separate from the rest of the structure compared with domains that pull the N-terminus towards the core of the domain by W-layer interactions (Fig. 3). The first TSP1 domain of ADAMTS13 also has a noncanonical fold, with the third W layer replaced by a charged arginine residue (Arg393) and the third arginine layer by a hydrophobic residue (Val405) (Fig. 3). The six TSP1 domains of properdin, structures of which have recently become available, show complete CWR stacking, with the exception of repeat 3, which contains a histidine instead of a tryptophan in the bottom layer (van den Bos et al., 2019;Pedersen et al., 2019).
Another variable feature among the TSP1 domains is the disulfide-bond pattern. The three C layers comprise one layer at the very top of the structure (when viewed with the N-terminus at the top) and two consecutive layers at the bottom, alternating with W and R layers. The bottom two C layers are conserved among all TSP1 domains and are formed between strands I and III, whereas the top C layer varies in its position and connectivity. In CCNs and spondins, Cys 4 is located at the top of strand III and is disulfide-bonded to Cys 1 at the very N-terminus of strand I. In thrombospondin, Cys 4 forms a disulfide with Cys 3 (which is missing from CCN-like domains) in the middle of the sequence, at the top of strand II. The differences in the disulfide connectivity in the first C layer at the top of the domain and the lack of the first W layer results in larger differences in the structure of the CCN3 TSP1 domain compared with other similar domains. The central W layer in CCN3 ensures that the core of the domain aligns well with those of other TSP1s. As is typical for a large family of disulfide-rich domains, there are always more subtle variations in the connectivity. For example, the circumsporozoite protein TSP1 domain (PDB entries 3vdl and 6b0s; Doud et al., 2012;Scally et al., 2018)  containing insertions in loop II-III, whereas micronemal protein 2 (PDB entry 4okr; Song & Springer, 2014) also contains a long insertion in loop II-III with an additional pair of disulfide-linked cysteines, and properdin repeat 5 contains a long insertion between Cys 3 and Cys 4 ( Supplementary Fig.  S2c). In addition, repeat 2 in properdin contains an additional cysteine between the canonical Cys 3 and Cys 4 which forms an additional disulfide to the linker to repeat 1 ( Supplementary  Fig. S2c).
Overall, the 'canonical' TSP1 domains with Cys 3 -Cys 4 connectivity are more structurally conserved with very well defined layered structure, whereas domains with Cys 1 -Cys 4 connectivity have more variable structures and are difficult to align unambiguously.
This difference in the top C layer can be used to categorize different TSP1 domains in matricellular proteins. Sequence alignment of selected TSP1 domains with alternative disulfide connectivities shows that while CCN and spondin proteins share the same Cys 1 -Cys 4 disulfide pattern, ADAMTS (an extracellular protease), UNC5C (a receptor for netrin) and properdin (a plasma protein), together with thrombospondin, form an alternative group (Fig. 4). To evaluate whether the differential connectivity results in inherent strain and possible lability of the disulfides, we analyzed the disulfide-bond geometry in the high-resolution TSP1 structures (Supplementary Fig. S3). The disulfide angles for the two classes of domains were not significantly different: 96 AE 26 versus 99 AE 25.5 for Cys 1 -Cys 4 and Cys 3 -Cys 4 connectivities, respectively. The energy of the disulfide bond is dependent on the dihedral angle across the disulfide, with energy minima around AE90 , suggesting that the two differing connectivities do not impose differential strain on the rest of the structure (Haworth et al., 2007). However, the functional implication of this structural division of disulfide connectivity is as yet unclear, even though Tan et al. (2002) originally suggested that the division of the top disulfide also divides the effect on angiogenesis. It still appears to be true that several proteins from the Cys 1 -Cys 4 disulfide group, including CCN proteins, spondins and HB-GAM, simulate angiogenesis (Kazanskaya et al., 2008;Papadimitriou et al., 2016;Kubota & Takigawa, 2007), while the Cys 3 -Cys 4 group, including thrombospondins, Unc5, properdins, ADAMTS proteins and papillin, inhibit angiogenesis (Lawler & Lawler, 2012;Larrivé e et al., 2007;Gaustad et al., 2016;Sun et al., 2015;Karagiannis & Popel, 2007). However, knowledge of more proteins with known functions from each group, and ideally complex structures with ligands bound differently owing to the different top disulfide, are needed before we can make a reliable statement regarding the correlation, if any, between the disulfide group and the functionality.

Redefinition of the TSP1 domain
The existing Pfam family sequence profile (release 32.0) was entirely built from sequences that had the Cys 3 -Cys 4 connectivity. This meant that matches to Cys 1 -Cys 4 connectivity domains were partial and thus missing the first critical cysteine residue. To remedy this, two new Pfam families were Multiple sequence alignment of the TSP1 domains from CCN1-CCN6 (UniProt O00622, P29279, P48745, O95388, O76076 and O95389), spondin-1 repeats 1-6 (Q9HCB6), thrombospondin-1 repeats 1-3 (TSP1-1 to TSP1-3; P07996), ADAMTS1 repeats 1-3 (Q9UHI8), UNC5C repeats 1-2 (O95185) and properdin repeats 1-6 (P27918). The dashed line in the middle divided the TSP1 domains into two groups according to their disulfide-bond patterns, with schematic representations on the left and connectivity highlighted at the top and bottom of the sequences. constructed to represent the two subtypes of Cys 1 -Cys 4 connectivity TSP1 domains related to those found in spondins and CCN proteins. These domain families have been deposited in Pfam with accession names TSP1_spondin (PF19028) and TSP1_CCN (PF19035). A new Pfam clan was also built to represent TSP1 domains, with accession name CL0692 (new TSP1 families), and the new clan will be released in v.33.0 of the Pfam database. We were further interested in knowing how common each of these connectivities were across known TSP1 domains. To investigate this, we constructed a sequencesimilarity network (SSN) of all domains and highlighted the different families. The SSN showed an additional group of sequences that matched the original TSP1 family but formed their own separated cluster, so a fourth TSP1 Pfam family was built to represent them. This new family (PF19030) was found to correspond to a set of domains from ADAMTS proteins that lack one of the tryptophans but replace one of the conserved arginines with a hydrophobic residue. Examples of this domain, for which no structures have yet been experimentally determined, can be seen as the second and third domains of ADAMTS1 in Fig. 4. The complete SSN displaying the relationships among the domains in these three families is shown in Fig. 5, along with the consensus sequence logo for each of the three families.
Updating the Pfam domain definitions has several important consequences. Firstly, the overall detection of TSP1 domains has increased by 17% from 57 847 to 69 393 (Fig. 5). This includes 13 proteins in Swiss-Prot, four of which are human CCN-family members in which TSP1 domains were previously not identified by Pfam. Secondly, the improved definitions allow stronger predictions of the disulfide connectivity of Pfam domain matches across all known proteins.

Discussion
The second X-ray crystallographic structure of a domain from the enigmatic CCN family of matricellular proteins, with the first one being the structure of the VWC domain of CCN3, revealed a variant form of the TSP1 domain with a limited -cation ladder typical of these proteins. While functional prediction from this structure of a small domain with limited conservation is difficult, the structure has helped to define the larger TSP1-domain population more accurately in the Pfam domain database and to better annotate these domains in sequence databases.
Methods for the production of CCN proteins and their fragments with high quality will allow us to use them for New TSP1-domain families. A sequence-similarity network (SSN) of TSP1 domains coloured according to the updated Pfam family definitions. Nodes represent domain sequences and edges represent BLAST hits with a score above 35 bits. Orange nodes belong to the newly defined CCN-type TSP1 family, maroon nodes correspond to spondin-like TSP1 domains, green nodes to the additional ADAMTS-type TSP1-domain family and blue nodes mark the original Pfam TSP_1 family. Domains for which experimental structures are known, as derived from the ECOD database, are shown as larger grey nodes and are labelled with their PDB codes. In the Venn diagram segments of the diagram that do not overlap with the blue circle (original Pfam family) represent newly identified TSP1 domains. The sequence logos show the conservation within each TSP1 family, with the heights of the letters correlating with conservation. further analysis of their molecular functions, the identification of interaction partners and the biophysical characterization of these interactions in vitro. With significant interest in these proteins as therapeutic targets, in fibrotic conditions in particular, correctly folded proteins will also facilitate the development of neutralizing antibodies against CCN proteins.