Experimental characterisation of de novo proteins and their unevolved random-sequence counterparts

De novo gene emergence provides a route for new proteins to be formed from previously non-coding DNA. Proteins born in this way are considered random sequences, and typically assumed to lack defined structure. While it remains unclear how likely a de novo protein is to assume a soluble and stable tertiary structure, intersecting evidence from random-sequence and de novo-designed proteins suggests that native-like biophysical properties are abundant in sequence space. Taking putative de novo proteins identified in human and fly, we experimentally characterise a library of these sequences to assess their solubility and structure propensity. We compare this library to a set of synthetic random proteins with no evolutionary history. Bioin-formatic prediction suggests that de novo proteins may have remarkably similar distributions of biophysical properties to unevolved random sequences of a given length and amino acid composition. However, upon expression in vitro, de novo proteins exhibit higher solubility which is further induced by the DnaK chaperone system. We suggest that while synthetic ran-dom sequences are a useful proxy for de novo proteins in terms of structure propensity, de novo proteins may be better integrated in the cellular system given their higher solubility.

maintained structural elements.However, both proteins appear to contain segments with high intrinsic disorder (ID).Bungard et al. (9) concluded that Bsc4 is best described as having a molten globule structure, suggesting that it may lack the defined folding funnel typical of many stable native folds.
Despite these examples, the structural properties of de novo proteins remain experimentally understudied.Computational prediction of the ID and aggregation propensity of de novo proteins has sparked hypotheses regarding the evolutionary pressure acting on newly-emerged proteins (11)(12)(13)(14).Foremost is the suggestion that avoidance of aggregation is a critical selection pressure acting on novel proteins (15).Selection against aggregation would also explain why many studies identify higher ID in de novo proteins (given the fundamental link between amino acid hydropathy and ID).More complete answers to these questions will come from experimental characterisation: this should reveal the true distribution of aggregation propensity/ID in newly emerged protein sequences.Ultimately, systematic experimental characterisation of novel sequences should reveal if novel proteins have the capacity to form folded structures and how frequently this occurs.
De novo proteins have sometimes been approximated to 'random' sequences based on the lack of selection upon their emergence.However, de novo proteins emerge from existing genomes that are already known to carry different sequence and compositional biases, e.g. in GC-content (16).Diverse areas of research have shown that compositional biases can significantly impact properties such as translation efficiency, aggregation propensity and even specific attributes of ID (15,17,18).The extent to which de novo and random sequences can be regarded as proxies therefore remains unclear.Moreover, random sequences represent true occupants of dark protein space, whose properties themselves are heavily understudied.This region of sequence space has typically been assumed to contain non-functional and disordered proteins, proteins which are likely to be toxic and degraded if expressed in cells.Nevertheless, many recent studies have identified both structure and function in random proteins.Structure itself appears to be abundant in protein sequence space.Secondary structure occurrence has been reported to be remarkably close to that of biological proteins.In addition, 20-40% of random sequence space has been observed to be resistant to proteolysis, likely due to tertiary structure formation (19)(20)(21)(22)(23). Furthermore, we were recently able to demonstrate that while structured random proteins are hard to express in vivo due to their higher aggregation propensity, random proteins with greater ID are readily tolerated by E. coli (22).Simultaneously, at least some protein folds appear to be relatively evolvable from random sequences.Hayashi et al. (24) were able to evolve an arbitrary random sequence to replace the D2 domain of an essential bacteriophage protein.
Function through binding may be the most likely role that an unevolved protein could attain.For example, ATP-binders have been selected from pools of random proteins (25).Random and partially randomised peptides have also been shown to have functional effects when expressed in vivo (26)(27)(28)(29).Finally, a smaller number of studies have evolved catalytic activity from randomised sequences, including esterase, barnase, and RNA-ligase activity; the presence of which is itself an indicator of structured catalytic centres (30)(31)(32).Altogether, while the above-listed studies suggest that both random and de novo proteins have non-zero structural and functional potential, their mutual relevance remains unclear.
Here, we set out to go further than previous studies by analysing the structural potential of de novo proteins.In doing so, we bring two strands of research together and experimentally characterise sets of i) 1800 putative de novo proteins identified in human and fly genomes and ii) 1800 synthetically-generated random sequences.While earlier studies were entirely computational, or experimentally characterised single proteins, we quantify the properties of putative de novo proteins and compare them to 'true' random sequences i.e. unevolved and synthetically-generated.We investigate two fundamental properties -solubility and structure content -using techniques previously unapplied to bulk analysis of de novo proteins.
We find that de novo proteins appear broadly similar to random sequences when length and amino acid frequencies are held constant.Consistent with computational prediction, the set of 1800 putative de novo proteins we study had similar overall protease resistance to the set of synthetic random sequences.This indicates that, at least given the amino acid composition of the de novo sequences chosen, random sequences have similar structural potential.However, we also find that de novo proteins are (moderately) more soluble at this composition and structure level.This is indicative of some selective pressure having acted over the course of their real -albeit short -evolutionary histories.

Results
A library-based approach enables high throughput investigation of de novo proteins.In this study we combine computational and experimental characterisation of two libraries: i) a set of 1800 putative de novo proteins identified in human or fly, and ii) a set of 1800 synthetic random sequences with no evolutionary history.Libraries were synthesised as an oligonucleotide pool, limiting proteins to 66 residues or less.A lower bound of 44 residues was chosen given the diminishing likelihood of domain-like structures for very short proteins.With these constraints, 1800 sequences were selected from published sets of novel protein (Fig. 1a).Fly sequences (n=176) are estimated to have emerged from previously non-coding intronic or intergenic regions <50 Mya, and all are annotated as protein-coding genes in Drosophila melanogaster (13).Human sequences (n=1624) are unannotated intronic or intergenic ORFs with Homo sapiens-specific expression (i.e.born <6.7 Mya) (12).We refer to the fly and human subsets of library DN as 'putative de novo proteins'.In both cases, proteins were found to have weak, tissuespecific expression, and low-to-moderate signals of selection.
Given the recent acquisition of these proteins and their apparent unevolved nature, it remains unclear how these novel proteins differ from 'true' random sequences if at all.For both human and fly sequences, various protein properties were predicted.Fly de novo proteins were compared to randomly sampled intergenic sequences without expression evidence, and found to have higher GC-content and ID.Human-specific ORFs identified by Dowling et al. (12), which make up the majority of the library DN, were not compared to a 'more random' set of sequences.However, they were found to have lower GC-content than conserved ORFs (i.e.'conservation level 5', with exon overlap), but similar predicted ID.This discrepancy between GC-content and ID may be explained by the action of selection: either on newly emerged proteins, or over longer evolutionary timescales to shape the properties of highly-conserved ORFs.
To identify such selection towards a given biophysical property, a natural and feasible approach is to compare the set of putative de novo proteins to 'true' random controls and see if they differ.For this reason, a synthetic random library (R) was designed, with amino acid frequency and length distributions matched to library DN.Given that amino acid composition is a major determinant of all biophysical properties, the specification of library R should provide the most appropriate comparison; any differences in protein property between DN and R should be attributable to the specific residue ordering (and not compositional bias; see Fig. S2).
De novo proteins are predicted to have highly similar properties to unnatural random sequences.Having designed libraries of putative de novo (library DN) and synthetic random proteins (library R) in silico, we next made

Tat export assay
No ampicillin 100 µg/mL ampicillin A de novo library (DN) was built from putative de novo proteins identified in human and fly: subsequently, a library of unevolved random sequences (R) was designed to mirror the length and amino acid frequencies of library DN.The two libraries were synthesised by OLS ready for experimental study.b) Approaches used to profile solubility and structure content of each library.Following amplification, each library was expressed in a chaperone-assisted cell-free format, and a non-specific Lon protease was used to quantify structural content.In parallel, subcloned libraries were expressed in E. coli to screen for soluble and folded variants that did not disrupt periplasmic export.

Next-gen sequencing
a number of bioinformatic predictions of protein property.
Figure 2 shows predictions for four relevant features.To put biophysical properties in context with those of conserved (i.e.'native-like' proteins), predictions are compared to a length-matched subset of 3600 annotated human proteins.In all cases, predictions for DN and R are highly similar.Predictions of ID distribute similarly for all three classes (Fig. 2a), as does aggregation propensity (Fig. 2b).Comparison to annotated human proteins suggests reduced propensity for α-helices in both libraries (Fig. 2c), but higher propensity for β-sheets (Fig. 2d).Accordingly, from primary sequence alone, libraries DN and R appear to have appropriate levels of hydrophobic and hydrophilic residues to form native-like structural content.For additional property predictions see Figure S1.
Prediction tools such as IUPred have been trained using the (relatively small) sets of proteins for which disorder or aggregation has been determined experimentally.Given the novel and unevolved nature of our libraries, we looked for a more generalisable predictor of structural content or stability.Learned embeddings have been described recently as a way to encode fundamental protein features learned over much larger regions of sequence space than have been experimentally characterised (33).For example, using UniRep embeddings as input, a linear model was shown to outperform Rosetta total energy predictions when trained on protease sensitivity data (34,35).
Prior to an experimental protease assay (see following sections), we implemented this predictive model to generate protease stability scores for each library.As shown in Figure S3, we find libraries DN and R have highly similar predictions.The control set of annotated human proteins are predicted to be marginally more stable on average.However, scores broadly overlapped with those for the DN and R.
Derived from trypsin-based proteolysis data, the stability values predicted here are expected to correlate with total structure content and globularity.Accordingly, together with secondary structure predictions (Fig. 2), both libraries appear to have potential for structural content similar to that of conserved proteins.While de novo proteins may distribute to a particular region of protein sequence spaceeither due to selection or as a byproduct of their occurrence in a genome -library R is not similarly constrained.Instead, the similarity of all predictions for DN and R with those for conserved proteins appear to result from their similar amino acid compositions.
Aside from illustrating that all random sequences with appropriate amino acid composition may have structure forming potential, the predictions made here demonstrate that any difference between this set of putative de novo proteins and their unevolved random counterparts are indistinguishable computationally.This hypothesis is entirely plausible, but testing it computationally relies on the accuracy of the predictors used; predictors which may not be sensitive to small differences, especially when compositional biases are removed.For this reason, we next sought to validate these predictions experimentally.
A cell-based protein export assay identifies soluble library members.Following in silico design, the libraries DN and R were synthesised as an oligonucleotide pool (Fig. 1a).De novo and random subpools were PCR amplified from this pool and used as a starting point for subsequent experimental work.We first used a twin-arginine export quality assay, which relies on translocation of β-lactamase via the twin-arginine translocation (Tat) pathway, to screen for soluble members of each library (36).This assay is implemented by sub-cloning each library to a vector encoding an N-terminal secretion-signal and a C-terminal β-lactamase (construct illustrated in Fig. 1b).Upon expression of the resulting fusion constructs in E. coli, successful export of the fused β-lactamase can be detected by colony formation on ampicillin plates.Ampicillin can therefore be used to select for library members that do not interfere with translocation.Twin-arginine export assay was previously shown to select for soluble target protein (37) and remove gene synthesis errors (38).We here use the assay to select for (and subsequently identify by sequencing) the soluble subsets of each library that do not result in aggregation of β-lactamase fusion proteins.
Selection of libraries DN and R on ampicillin, followed by NGS-based quantification of library diversity (i.e. the number of unique sequences represented), allows identification of soluble subsets of each library (and additionally an assessment of library quality: Figs.S4 & S5). Figure 3 shows the results selection on 100 µg/mL ampicillin (the highest concentration assayed).When plated without ampicillin at 30 °C (Fig. 3a), 80-85% of theoretical library diversity was identified above a threshold of 100 reads-per-million (DN; 81%, 1452/1800.R; 83%, 1501/1800; for read-count distributions see Fig. S4).Post-selection on ampicillin, the fraction of the library identified by sequencing dropped to 43% and 34% for libraries DN and R, respectively.This indicates that the majority of both libraries are insoluble when expressed recombinantly in E. coli.Aggregation of de novo proteins expressed recombinantly has been noted previously and is consistent with this result (6).
Repeating the same assay at 37 °C (Fig. 3b), we found overall lower diversity on pre-selection plates (DN; 63%, 1142/1800.R; 61%, 1050/1800).However, a greater relative drop in representation was seen upon ampicillin selection, to 14% and 7% for libraries DN and R, respectively.A greater efficacy of selection for solubility at 37 °C is consistent with greater overexpression at 30 °C -and could also indicate the presence of slow folders which are less able to avoid aggregation at increased temperatures.Interestingly, the trend for putative de novo proteins to have higher solubility on average than synthetic random sequences held at both temperatures, and was also consistent when split by human and fly subsets (see Fig. S6).Furthermore, many of the sequences selected at 37 °C were also selected at 30 °C and may represent the most 'robustly soluble' sequences.a b 37°C 30°C Fig. 3.A cell-based assay identifies subsets of each library with potential for soluble expression.NGS of input (plated without ampicillin) and selected libraries (+100 µg/mL ampicillin) allows quantification of changes in library diversity following twin-arginine export assay.a) Libraries DN and R had 43% and 34% representation following selection at 30 °C.b) At 37 °C, surviving fractions dropped to 14% (DN) and 7% (R), indicative of higher solubility for the putative de novo library.The majority of surviving variants at 37 °C also survived at 30 °C; hatched bars indicate shared representation at each temperature.
Putative de novo proteins may have higher intrinsic solubility than unevolved random controls.To further investigate the properties of our putative de novo and true random sets, libraries were expressed in a cell-free format using a reconstituted E. coli expression system including transcriptional and translational machinery.
Cell-free (in vitro) recombinant expression has two key benefits in this case: first, it allows tight control of expression conditions and control of cofactor concentrations, and second, it separates intrinsic target-protein behaviour (e.g.aggregation propensity) from the complex cellular milieu (39).Libraries were expressed in vitro with a C-terminal FLAG®-tag and target protein detected by Western blot (Fig. 4a).In addition to total yield (T), the subset of soluble library protein is isolated and loaded in adjacent lanes (S).Quantification of the intensity of the soluble:total lanes therefore provides an estimate of the fraction of soluble expression in a given sample.
Here, base expression (left panel) was compared to yield achieved in the presence of molecular chaperone systems added to the cell-free reaction (see Methods).GroEL/ES and DnaK systems were added co-translationally i.e, were present from the start of the reaction.As can be seen in Figure 4a, soluble protein makes up only a fraction of total expression in the absence of additional molecular chaperone.This was true for both the putative de novo proteins (top row) and the set of true random sequences (bottom row).Most notably, the soluble fraction inferred from blot intensity is consistent with the fraction of each library selected by the Twin-arginine assay at the same temperature (37 °C).The same trend for DN to be slightly more soluble than R is also seen here.
Upon addition of GroEL/ES system ('GroEL+'), no major difference in soluble yield was seen for either library.However, upon DnaK addition ('DnaK+') both libraries were highly solubilised (seen by intensity in lane S being close to that in lane T).When both DnaK and GroEL/ES systems were added, the improved solubility was maintained for library DN.However, for library R, addition of GroEL/ES appeared to counteract the effect of DnaK and solubility dropped closer to basal levels.
The difference seen between GroEL/ES and DnaK may be explained by their differing mechanisms; although GroEL has been shown to interact with random proteins (40), its more involved mechanism may require a greater degree of substrate adaptation.Figure 4b shows predicted DnaK binding sites for each library, compared to the subset of length-matched annotated human proteins.Library sequences are predicted to have on average four regions for which DnaK should have high affinity (short hydrophobic regions with positively charged residues) (Fig. 4b).This is comparable to the prediction for conserved proteins, which may help explain why DnaK is effective and acts similarly for libraries DN and R (∼3-fold solubility increase).

Proteolytic assay identifies large amount of undegradable protein for de novo and truly random sequences.
Having seen consistent trends for the solubility of each library when expressed in E. coli as fusion constructs, and in a cell-free format, we next investigated the structural content using a Lon-based proteolytic assay (23,41).Using the same cell-free expression system (see Fig. 4a), Lon protease was added to reaction mixtures.Lon's preference for non-specific cleavage of exposed hydrophobic regions means that it causes the greatest amount of degradation for IDP-like proteins, and in general for proteins with lower structural propensity.
Figures 5a and 5b show representative blots for libraries DN and R, respectively, with addition of DnaK and Lon protease to cell-free reaction mixtures.Quantification of blot intensity over replicate blots allows an estimation of the degradable fractions of each library with respect to solubility (see Methods for more details).This is illustrated in Figure 5c, with soluble fractions (green hues) split by degradability (dark blue; soluble/undegraded, pale blue; soluble/degraded).The degraded and undegraded fractions of insoluble yield can also be inferred in this way (dark yellow; insoluble/undegraded, light yellow; insoluble/degraded).
As can be seen in Figs.5a and 5b, addition of Lon protease causes a reduction in both the total yield and that of the soluble subset (where degradation is most visible).The fact that some soluble protein remains undegraded points to a degree of structural content even for the soluble fraction.In other words, a fraction of both the de novo and true random proteins have soluble expression, not all of which consists of IDP-like proteins (i.e.soluble and disordered).Quantifying this in Figure 5c shows that, considering only the soluble fraction, library DN has a greater proportion of these IDPlike proteins than library R, where less of the soluble fraction was degraded than not.In the insoluble fraction, for both libraries the majority of protein is inferred as undegradable.
We suggest that this corresponds to insoluble proteins with above-average structural potential.
With addition of the DnaK, the same solubility increase as before (see Fig 4a) was seen.Comparing library DN to its no-DnaK reference suggests that, DnaK has acted to prevent much of the soluble/undegraded fraction from converting to the insoluble/undegraded fraction.Similarly, DnaK appears to have prevented much of library R's soluble/undegradable fraction from aggregating.However, solubilisation of library R does not appear to result in a concurrent increase in the soluble/degraded fraction (IDP-like).This may be best explained by the overall lower degradation seen for library R. Combining soluble and insoluble fractions, library R can be seen to have higher apparent structural propensity compared to library DN (Fig. 5d).

Discussion
Given an emerging picture of abundant structure and function within sequence space, an outstanding question is if de novo proteins differ from other classes of random protein.
In other words: do de novo proteins occupy a privileged area of sequence space with respect to structure or function?
Direct attempts to answer this question have so far not been made.Instead, experimental evidence from unnatural random sequence libraries have formed the basis for many hypotheses regarding de novo emergence.Further, direct investigation of de novo proteins has been limited to either computational prediction or experimental characterisation of individual proteins.Going beyond these studies, we assess a library of putative de novo proteins experimentally and compare their properties to a matched library of unevolved random sequences.In doing so, we show that recently emerged de novo proteins behave similarly to unevolved counterparts -but that the set of de novo proteins harbours a larger fraction of soluble and protease-sensitive sequences.
Recent improvements in DNA synthesis technology have made it feasible to generate large libraries of high-fidelity sequences.Using oligonucleotide library synthesis (OLS), it is possible to investigate proteins in high-throughput by direct specification of their coding sequences.We focus on short de novo proteins (<66 aa) that we previously identified in human and fly, which can be synthesised directly in a single oligonucleotide.However, multiplex gene synthesis also makes this approach applicable to longer proteins specified over multiple oligos (38,42).Libraries generated in this way should ultimately allow coupling of computational identification and high-throughput investigation of diverse protein sequences.Non-specific cleavage of hydrophobic regions by Lon protease results in preferential degradation of disordered proteins, with a visible net reduction in yield for Lon+ samples.c) Quantification of degraded fractions with respect to solubility reveals a greater IDP-like (soluble/degraded) fraction for putative de novo proteins vs. 'true' random sequences.DnaK addition, however, results in a greater increase in the soluble/undegraded fraction than the IDP-like fraction (for both DN and R).d) Summary of degraded vs undegraded fractions, regardless of solubility (sum of dark and light bars in (c), respectively).Library R is marginally less degradable than DN, suggesting slightly higher structural propensity.
Having designed a library of 1800 random sequences (R) to have matched amino-acid frequencies and lengths as a set of 1800 putative de novo sequences (DN), we ran primary-sequence based predictions for a number of biophysical properties.Given that all computational predictions are highly similar between the two libraries, a possible conclusion is that our library of de novo proteins is generally close to the set of synthetic random sequences, and that their shared biophysical propensities result from their matched amino acid compositions.However, the reliability of predictions for random-type proteins remains ambiguous, given that it is only possible to validate prediction tools on well characterised proteins which are typically well conserved.Furthermore, the predictors rely heavily on sliding-window assessments of sequence composition which could struggle to differentiate DN and R. In light of this, experimental characterisation remains critical to any conclusions regarding this class of proteins; a step that has until now not been attempted for more than a handful of de novo proteins.
We first assessed solubility of our libraries using a twinarginine export quality assay (36), shown to select for soluble and folded proteins (38).Sequencing of libraries DN and R after selection showed that at least one third of each library (43% and 34%, respectively) has potential for soluble expression at 30 °C.Interestingly, computationally predicted properties did not correlate with those sequences most enriched by selection (i.e. the most soluble variants).Any distinguishing properties of these sequences were therefore not captured by computational tools, further highlighting the need for experimental characterisation.
Next, we expressed each library in cell-free format using reconstituted E. coli expression apparatus.Given that the putative de novo proteins were sourced from human and fly, cell-free expression allows separation of the inherent biophysical properties of each library and the unnatural E. coli cellular environment.In addition, the cell-free format enables systematic changes to expression conditionsincluding addition of molecular chaperones to aid solubility, or proteases to assess protein stability.In the absence of chaperones, we found putative de novo proteins to have higher solubility than their unevolved random counterparts (∼30% soluble vs ∼15%).This trend is in agreement with the the twin-arginine export assay, with a higher fraction of the de novo library having soluble potential.The higher solubility of putative de novo proteins may reflect their exposure to selection; avoidance of aggregation has been suggested as a key selective pressure on novel proteins (15).Despite their recent emergence, and typically low and tissue-specific expression, selection may have shaped the properties of these sequences to some degree.
We next tested the effect of two chaperone systems, GroEL and DnaK, on the expression of each library.While GroEL had no effect on solubility or overall expression, DnaK increased the soluble fraction of both libraries by ∼3-fold.This resulted in soluble fractions of ∼90% (DN) and ∼60% (R), most likely due to DnaK having similar effectiveness on both libraries and preventing approximately equal amounts of protein from forming insoluble aggregates.The effectiveness of DnaK on random proteins was demonstrated recently (23).Confirming this result for putative de novo proteins indicates that DnaK (or its eukaryotic homolog Hsp70) may be essential for avoidance of aggregation in the early stages of protein evolution.
Finally, to probe the structural content of each library, we included Lon protease in the cell-free expression system (41).By preferentially cleaving exposed hydrophobic regions of unstructured proteins, Lon degradation correlates with intrinsic disorder (43).A Lon-based method was recently used to probe random-sequence libraries of different amino acid compositions (23), identifying a significant proportion of the soluble fraction of each library to be resistant to degradation.In addition, increasing solubility with DnaK also had a small effect on the fraction of non-degradable protein.While the precise fractions of degraded protein for each condition should be interpreted with care, in both cases over 50% of soluble protein was not degraded by Lon upon DnaK addition.A subset of each library may therefore harbour structural elements that interfere with cleavage, in agreement with findings that structure is abundant in sequence space (23).However, the low resolution of the Lon-assay prevents differentiation of different forms of structural elements, such as oligomeric or molten globule.Interestingly, we find 10-20% higher degradation for putative de novo proteins compared to synthetic random sequences, in agreement with our earlier report showing that unevolved sequences with less structural content are more soluble upon expression in E. coli (22).
Although putative de novo proteins appear marginally more soluble than synthetic random proteins, both show sensitivity to molecular chaperones.Similarly, while a subset of both libraries may harbour structural content, putative de novo proteins appear to contain more disordered regions, in correlation with their higher solubility.We note that our study is limited to short proteins of a specific composition and GC-content distribution.While the results presented here transcend earlier computational analyses and studies of single de novo proteins, we note that it is not possible to ultimately prove any instance of de novo emergence and there remains a degree of uncertainty about the true origin of the putative de novo proteins studied here.Some of the putative de novo set, in particular those from H. sapiens, may in fact be transient short-lived proto-genes which have not yet assumed critical cellular roles (but are nonetheless evolutionarily highly relevant; see Keeling et al. (44)).
In summary, we suggest that de novo proteins are not especially privileged among random sequences, and that the propensity for structure across sequence space may be key to the feasibility of de novo emergence.However, our findings of higher solubility for putative de novo proteins are consistent with early selection pressure to avoid aggregation.To corroborate this finding, larger numbers of de novo proteins drawn from diverse genomic backgrounds should be characterised in future efforts.98183 to KH and EBB.KH, FB and VT were additionally funded by Primus grant PRIMUS/20/SCI/012 from Charles University.MA received funding from a DAAD Research Scholarship for doctoral students.Open Access funding provided by Projekt DEAL. Figure 1 was created with BioRender.com.

Methods
Library sequence selection.To study the properties of de novo and random-sequence proteins experimentally, two libraries were first designed in silico.In prior work, we identified large sets of putative de novo proteins which appear to have emerged from previously non-coding DNA.To build a de novo library (DN), 1800 proteins were selected from two studies identifying de novo genes in fly (n=176) (13), and newly-transcribed human ORFs (n=1624) (12) ('conservation level 0' in Dowling et al. (2020), excluding ORFs with exon-overlap).A library of 1800 unevolved randomsequence proteins (R) was then generated synthetically by sampling amino acids using the frequency distribution of library DN.Sequence lengths were also matched to those of library DN, so that library R had identical length and amino acid composition to library DN.
Oligonucleotide pool design.Libraries DN and R were synthesised as a SurePrint oligonucleotide pool by Agilent (DE).Oligonucleotides were specified to include NdeI and XhoI restriction sites 5'-and 3' to the CDS for downstream cloning.Additionally, 15 bp primer sites were added up-and downstream of the restriction sites to allow libraries DN and R to be PCR amplified separately from the oligo pool.The DnaChisel package (45) was used to codon optimise CDSs for protein expression in E. coli, while avoiding introduction of undesired restriction sites and homopolymer repeats of 5 bp or longer.Starting from desired amino acid sequences, we selected the highest frequency codon according to E. coli K12 frequencies (http://www.kazusa.or.jp/codon), and DnaChisel's 'harmonized Relative Codon Adaptiveness' implementation was used to replace rare codons (46).Code to generate optimised oligo pools was used here as follows to select and optimise the 1800 longest compatible open reading frames (ORFs) from a list of human and fly de novo ORFs: Prediction of protein properties.Intrinsic structural disorder and globularity were calculated using IUPred2a (47); secondary structure, Phi and accessible surface area (ASA) were predicted using SPIDER3 (48); aggregation propensity was predicted using TANGO (49); isoelectric point (IEP) was predicted using EMBOSS pepstats (50); and grand average of hydropathy (GRAVY) index was calculated using CodonW (51).To predict stability scores, we used an implementation of UniRep (35,52) to generate sequence embeddings of size 1900, and trained a sparse linear model (Lasso leastangle regression with 10-fold cross-validation) on a dataset of de novo-designed proteins with experimentally determined stability scores (34), as described by Alley et al. (35).As a comparison for predictions, 3600 annotated human proteins (Ensembl 97 H. sapiens proteome) were selected by random sampling of an equal-length protein for each member of library DN.DnaK binding sites were predicted using the ChaperISM suite (v1) in quantitative mode with default settings (53).
Twin-arginine export quality assay.To screen for soluble proteins, libraries were expressed as fusions with an N-terminal Tat secretion signal (ssTorA) and a C-terminal β-lactamase.Misfolding or aggregation of the target ORF should prevent secretion of the construct to the E. coli periplasm, allowing selection by plating on increasing concentrations of ampicillin.Libraries DN and R were PCRamplified separately from the oligonucleotide pool, with primers introducing EcoRI and BamHI restriction sites.After restriction cloning to pSALECT-EcoBam (Addgene plasmid 59705), libraries were transformed by electroporation to E. cloni 10G SOLO cells (Lucigen).Whole transformations were plated on LB-agar + 25 µg/mL chloramphenicol and grown overnight.Libraries were then scraped from plate into LB medium to make glycerol stocks adjusted to have the same OD600.Stocks were kept at -80 °C and used for all subsequent plating assays.The assay involved plating equal volumes of glycerol stock on LB-agar supplemented with either: 25 µg/mL chloramphenicol, 25 µg/mL chloramphenicol and 4 µg/mL ampicillin, or 25 µg/mL chloramphenicol and 100 µg/mL ampicillin.After incubation overnight at 30 °C, plates were scraped into PBS and plasmid isolated (GeneJET Plasmid Miniprep Kit, Thermo Scientific).Primers encoding 8-bp 5' and 3' barcodes were used to amplify samples from each condition.
Next-generation sequencing.Amplicons from Twinarginine export assay conditions were purified, combined in equimolar amounts, and amplicon size distribution (270-350 bp) verified by capillary electrophoresis.Amplicons were subsequently sequenced using an Illumina MiSeq platform.Reads were merged, trimmed and filtered to remove low quality reads using the fastp suite (54).The cutadapt suite (55) was used for read demultiplexing, and reads were then mapped to CDS sequences of libraries DN and R using the Burrows-Wheeler Alignment (BWA) MEM algorithm (56).SAMtools was used for conversion to SAM file format, sorting and indexing (57).Finally, reads mapped to each variant were counted using HTSeq (58).Read counts were converted to reads-per-million-reads (RPM) values (per plating condition) to control for sequencing depth, and sequences were subsequently filtered using a threshold of 100 RPM to remove those with very low abundance (i.e.<0.01% of reads in a given sample).
Cell-free expression and Lon proteolytic assay.Both protein libraries were produced in a cell-free expression system to evaluate their solubility, response to chaperones and structural content (using proteolysis resistance) in a cell-like environment.Expression from mRNA templates was carried out in E. coli reconstituted cell-free system and solubility was assessed by centrifugation to separate soluble fraction, followed by quantitative Western blot.Bacterial Lon protease preferentially cleaves unstructured proteins and was added to the reactions to investigate proteolytic resistance potential of the protein libraries (23).
First, library subpools were PCR-amplified to introduce EcoRI and BamHI restriction sites, subcloned into pET24a+ vector modified to encode a C-terminal FLAG®-tag and electroporated into E. cloni 10G cells (Lucigen, USA).Cells were grown overnight at 37 °C on LB-agar + 50 µg/mL kanamycin plates and transformants scraped for plasmid DNA isolation.The region containing the T7 promoter, library sequence and terminator was PCR amplified to serve as template for in vitro transcription (NEB HiScribe T7 kit, USA).The PUREfrex 2.0 system (GeneFrontier Corporation, Japan) was used for in vitro translation.The reactions were mixed as per protocol to final volume 10 µL with addition of 0.05% Triton X-100 and incubated at 37 °C for 2 h.To assess the effect of molecular chaperones on the soluble yield of protein expression, reactions were supplemented with DnaK or GroE mix (GeneFrontier Corporation, Japan), to final concentration of 5 µM DnaK, 1 µM DnaJ and GrpE, 0.1 µM GroEL and 0.2 µM GroES.For proteolytic resistance assay, purified Lon protease was added co-translationally at 0.1 µM working concentration.
Following production all reactions were halted by adding 40 µL of puromycin buffer (300 µM puromycin, 50 mM Tris, 100 mM NaCl, 100 mM KCl, pH 7.5) and incubating at 30 °C for 30 min.Next, 5 µL of such mixture was processed for SDS-PAGE serving as the total (T) fraction of expression, while the rest was centrifuged (21,000 x g, 30 min, 21 °C).Soluble (S) fraction was collected by taking 5 µL of the supernatant.Finally, three technical replicates for each sample were analyzed by SDS-PAGE and Western blot using Anti-FLAG® antibody (Sigma-Aldrich Monoclonal ANTI-FLAG® M2-Peroxidase (HRP) antibody, A8592).Images were quantified using ImageJ (U. S. National Institutes of Health, USA).

Fig. 1 .
Fig. 1.Library design, synthesis and experimental outline.a)Schematic illustrating the in silico design of libraries of de novo and unevolved random-sequence proteins.A de novo library (DN) was built from putative de novo proteins identified in human and fly: subsequently, a library of unevolved random sequences (R) was designed to mirror the length and amino acid frequencies of library DN.The two libraries were synthesised by OLS ready for experimental study.b) Approaches used to profile solubility and structure content of each library.Following amplification, each library was expressed in a chaperone-assisted cell-free format, and a non-specific Lon protease was used to quantify structural content.In parallel, subcloned libraries were expressed in E. coli to screen for soluble and folded variants that did not disrupt periplasmic export.

HeamesFig. 2 .
Fig.2.Biophysical predictions are similar for de novo and unevolved random sequences and suggest that both harbour high structural potential.Libraries DN (dark blue) and R (pale blue), designed to have matched length and amino acid frequencies, are predicted to have highly similar biophysical properties as expected.Comparison to a length-matched subset of the human proteome (yellow) shows broadly similar predictions, suggesting that native-like properties are present in, or at least evolutionarily accessible to, random sequence proteins.Diamonds indicate mean value of distributions, which are subsampled to 250 sequences for visualisation.

HeamesFig. 4 .
Fig. 4. Cell-free expression indicates that a library of putative de novo proteins (DN) are more soluble than synthetic random sequences, and that DnaK solubilises both libraries equally well.a) Western blot showing total (T) and soluble (S) fractions of bulk library expression using reconstituted E. coli machinery in cell-free format (37 °C).Library DN (top row) is marginally more soluble than library R (bottom row).Co-translational chaperone addition (DnaK, GroEL, or both) shows that GroEL has little effect but that DnaK solubilised both libraries.b) Bioinformatic prediction of the number of DnaK binding sites per sequence for libraries DN and R, with a length-matched set of annotated human proteins included for comparison.

Fig. 5 .
Fig. 5. Quantification of degraded library fractions following cell-free expression in the presence of Lon protease.Total (T) and soluble (S) expression with co-translational (37 °C) addition of DnaK and/or Lon protease: representative Western blots shown for libraries DN (a) and R (b).Non-specific cleavage of hydrophobic regions by Lon protease results in preferential degradation of disordered proteins, with a visible net reduction in yield for Lon+ samples.c) Quantification of degraded fractions with respect to solubility reveals a greater IDP-like (soluble/degraded) fraction for putative de novo proteins vs. 'true' random sequences.DnaK addition, however, results in a greater increase in the soluble/undegraded fraction than the IDP-like fraction (for both DN and R).d) Summary of degraded vs undegraded fractions, regardless of solubility (sum of dark and light bars in (c), respectively).Library R is marginally less degradable than DN, suggesting slightly higher structural propensity.
p y t h o n b u i l d _ o l i g o s .py −i d e n o v o _ o r f s .c s v −s e _ c o l i −c h a r m o n i z e _ r c a −t h _ s a p i e n s −n 1800 −r 1 −d p r i m e r s .db −p 15 −fL CAT −fR CTCGAG