Abstract
In fifteen years, Synthetic Biology (SB) has moved from proof-of-concept designs to several flagship achievements. Standardisation efforts are still under way, basic engineering concepts such as modularity and orthogonality are still controversial in biology, and making predictions from computer models is still unreliable. A deep characterization in the pattern of re-use of biological blocks in SB has not been attempted to date. We have compared the topological organisation of two different technological networks, one associated to a standard, large-scale software repository and the second provided by the Registry of Standard Biological Parts (RSBP). Our results strongly suggest that software engineering, and not industrial engineering, is the closest complex system to SB. In both cases, combining standard or quasi-standard components assembly with tinkering may not be at odds with success.
I. INTRODUCTION
Any technology experiences a phase of explosive growth as a consequence of the combinatorial potential that pervades innovation (Arthur 2009). As more inventions become available, the repertoire of possible artefacts that can be created using previous pieces, designs or modules expands exponentially. The textbook example is provided by electronics (Williams 1985). Starting from the first large, clumsy and inefficient transistor, a rise of Information Technology (IT) became a reality due to the open-ended nature of combinatorial logic (any circuit can be designed from small pieces) and the parallel development of software engineering (Mens and Demeyer 2008). Actually, these two domains coevolved over time, completely changing our relationship with information and design. Synthetic Biology (SB) is an emerging field that offers a similar potential to change the biotechnology landscape. Largely inspired in standard IT, it focuses on the translation into the biological realm of engineering pillars such as modularity, orthogonality and standarization (Purnick and Weiss 2009).
In just one decade, SB has moved from proof-of-concept designs to several flagship achievements attributed to SB-driven strategies. These include microbial drug synthesis, production of new biofuels or novel approaches to disease treatment (Church et al., 2014). The newborn discipline has risen great expectations but also considerable hype. One particular claim is the capacity of SB to exploit combinatorial design principles as the engine of novel, more complex living machines by assembling functionally self-contained parts such as promoters, coding sequences, ribosome binding sites, terminators or protein domains. In this context, the informational nature of living systems (Maynard Smith 2000, Nurse 2008) supports an analogous path of technological development. But the LEGO-like notion of the field has been seriously questioned (Kwok 2010) thus raising two questions. The first is the value of the similarities between IT and SB. The second, whether the landscape of SB designs is growing as a consequence of combinatorial design. In order to answer these questions, a quantitative, systems-view of the technological landscape is needed.
II. HARDWARE OR SOFTWARE?
Mounting concerns have emerged in relation to the ambiguity of the industrial engineering and electronic metaphoric pools (de Lorenzo, 2011; Porcar et al., 2015), on the difficulties for living things to fit rational design (Collins et al., 2014) or on the inexactitude of the very concept of cells as biomachines (Nicholson, 2013). Are designed constructs like small, widely re-usable components? (Andrianantoandro et al., 2006). If not, its potential as a fast expanding, successful technology might be questionable. What kind of approach can be used to prove or disprove the appropriateness of the analogy? Because the currently adopted representation of cells and synthetic circuits is in terms of logic gates, genetic parts might appear closer to hardware components. In this context, it has been recognised that genetic pieces are actively read-out as a source of algorithmic instructions (Walker and Davies 2013). By contrast, transcription factors (and other parts of the molecular cell machinery) work as hardware operating on genetic instructions. If we need to compare technological fields, a new comparative frame could be software engineering. Using system-level, network approaches, it has been shown that software systems can be described as complex webs of interacting parts (Valverde et al., 2002; Myers, 2003; Potanin et al., 2005) characterised by an extensive reuse of subsystems. More importantly, there are remarkable convergent traits shared between large software structures and molecular cell networks (Solé et al., 2011). In this paper, we will identify global patterns of interaction in networks of co-occurrence of genetic parts used in SB designs in order to search for a fingerprint of combinatorial reuse.
There is an additional reason to consider software engineering here. A specially relevant connection between software development and the potential future of synthetic biology stems from the serious challenges experienced by the former shortly after it started to rise. Already in the 1960s, rising concerns emerged once software projects started to become more complicated, poorly specified and prone to unexpected failures and defects. The failure of software to match continuous advances in hardware constrained the capacity of programmers to effectively use machine characteristics (Dijkstra, 1972). The main cause of this failure was a combination of poor quality and the lack of reusability of software components. Both factors led to chronic maintenance problems (Hooper and Chester 1991). A proposal to solve this “software crisis” was based on the reuse of interchangeable components (McIlroy, 1976). This reuse-based approach to software development, although promising, had limited applicability except for libraries of highly-specialised domains, e.g., mathematical routines. Instead of a successful standardisation, a multiplicity of libraries, codes and programming languages and specialised domains emerged. Soon it became clear that even reliable parts, once in a new context, are not necessarily reliable (Coulange 1998). Some of the problems were solved two decades later while others still persist today (Garlan et al., 1995; Mili et al., 1995). For most practitioners of SB design, the previous story might sound familiar.
III. GENETIC PARTS VERSUS NETWORKS OF PARTS
The most used and well characterised compilation of biological parts are BioBricks™, a repository of up to 7822 DNA sequences, most of which issued from the international Genetically Engineered Machine (iGEM) competition. Critical meta-analyses on the performance of Biobricks are scarce (Vilanova and Porcar, 2014). In the present work, we have tackled for the first time a global analysis of the biological components from the Registry of Standard Biological parts (http://parts.igem.org/Registry_API), and compared its structure and re-use features with both LEGO© (Crompton, 2012) and a large software system (Valverde et al., 2002).
The LEGO© metaphor has been often used within SB (Cserer and Seiringer, 2009; Collins, 2012) as a way of referring to the combinatorial potential: not surprisingly, the maximum award in iGEM is given in the form of an aluminium BioBrick trophy. There is a statistical pattern, which is shared by all the systems discussed here. It is given by the rank-frequency distribution, which gives the number of times N(r) that the r-th most common element has been found in a given collection of Lego sets, displayed in Fig. 1a (Crompton, 2012). In Fig. 1b we plot (blue dots) the corresponding distribution for the frequency of use of genetic parts. A similar pattern is shared by the frequency of use of software components within the JAVA library, a good example of object-oriented software (Fig. 1b, red dots). Here too, most elements in the library are used in a very specific way, whereas a few are widely used. In both cases, N(r) follows the so called Zipf’s law (Newman, 2005) for large r. Zipf’s law is characterised by a power law decay where frequency decays inversely with rank, i. e. N(r) ~ r-γ with γ=1. The origins of this fat-tailed statistics has been studied and are known to be a consequence of extensive reuse of existing structures by developers (Valverde and Solé, 2005).
Both distributions are highly skewed, indicating that they are dominated by a few, very common pieces while most pieces are rare. This indicates that a handful of objects are easily incorporated (the “generalists”, i. e. small pieces capable of gluing together other elements) while the majority are specialised. These are broad statistical distributions that can be obtained from different types of mechanisms. Having fat-tailed distributions is no guarantee that reuse is the leading mechanism. To test this possibility, we need to measure the actual interactions among parts.
IV. NETWORK STRUCTURE AND PATTERNS OF REUSE
The frequency-rank distribution does not incorporate the real architecture of technological complexity, which is associated to the way elements connect to each other. This means that, in order to get a global picture of the organisation of these systems, a network approach is required (Dogorovtsev and Mendes 2003; Newman, 2010; Valverde and Solé, 2005). By considering the links existing among gates within electronic circuits (Ferrer et al., 2001), libraries within software projects (Valverde et al., 2002; Myers, 2003; Potanin et al., 2005), gene networks (Teichmann and Babu, 2004) or influences among inventions (Valverde et al., 2007), it is possible to unravel the origins of the networks connecting logic components, programs or patents. Of note, several studies have shown that genetic networks and software programs share common structural principles (Yan et al., 2010). To further investigate the parallelisms and differences between software and BioBricks, we performed a network analysis of dependencies among JAVA components and genetic parts.
From the Registry of Standard Biological Parts (RSBP) we reconstructed the graph (hereafter GR) of logical dependencies among parts. The way this was done is illustrated in Fig. 2a with an example. Here, a node represents a biological part and a link between two parts indicates that they are simultaneously engineered within the same construct. The arrows point from a given object to its constituents. The result of this is the BioBrick network, which is composed by several disconnected subgraphs. Most parts in the Registry (88%) belong to one single, large subgraph (Fig. 2b) which we have studied in detail. A few highly connected components create a dense central region, while the majority of nodes are, in average, associated with a small number of other parts. For comparison, in Fig. 2c we also display the largest connected component of the JAVA software network (hereafter GR). In both cases, links describe the use dependencies among either software (Valverde et al., 2002) or biological parts. The different shape of both graphs suggests very different global patterns of organisation, as confirmed by our quantitative study. One obvious and significant difference is the distinct density of connections. The JAVA graph is sparse, as most technological networks, exhibiting a reduced average number of links (average degree) resulting from a supervised pruning of redundant relationships. This is one desirable feature that is far from present in the RSBP graph, where many connections seem to indicate that no such supervised process is at work. Instead, as new parts are added, they are likely to be connected to other parts irrespective of the increasing redundancies that can arise.
A common trait in many complex networks is the presence of Small-World (SW) effect (Watts and Strogatz, 1998). It is characterised by two key traits: (a) very small path lengths dL, i. e. short paths separating any two arbitrary nodes; and (b) a high clustering coefficient C: many triangles are present, indicating that two nodes sharing a given node are likely to be also connected between them. This feature has proved in systems with extensive reuse. The standard definition of the SW assumes two basic conditions: (1) dL ≈ drand and (2) C ≫ Crand; in which we compare the values of dL and C with the expected values drand and Crand of a random version of the networks with all links randomly assigned. Technological webs such as GJ are small worlds: here we have dL(GJ) = 5.4, drand(GJ) = 6.28 and C(GJ) = 0.016 whereas the Registry graph is clearly not a SW because dL(GR) = 3.42 and C(GR) = 0.0002. Here, we have a very low clustering (close to random) and the path length is below the random expectation drand(GR) = 4.14.
Dramatic differences arise when looking at the modular structure of these webs. A desirable, universal feature of both biological and technological networks is the presence of modularity (Baldwin and Clark, 2000). Modular patterns provide a source of robust local division of labor within a complex system: damage or malfunction of parts within a module do not propagate outside it. In a network context, a module is (roughly speaking) a subsystem such that its components are more connected among them than with the rest of the network. Modularity is measured by means of a parameter Q that weights to what extent a given network exhibits modular architecture (Newman, 2010). Here Q is close to 1 if the system is composed of very well-defined, weakly disconnected modules; whereas low values indicate that the system is weakly (or not) modular. Standard systems as JAVA have high Q values (here QJ = 0.75) revealing that the technology is characterised by modules of functionally related components but also displaying multiple dependencies between these modules. By contrast, the Registry network has a rather small Q value (QR = 0.3), very far from its theoretical maximum. Such a lack of modular architecture tells us that there are no communities of more or less functionally related pieces. Almost everything connects with the “trivial” hubs, with little more organisation. Other measurements confirm this picture. Taken together, our results indicate that the Registry graph is far from a well-organised modular network, and thus lacking the expected attributes of a technological system with differentiated subparts.
V. LESSONS FOR THE FUTURE
The comparative analysis we report here between biological parts and software components reveals that the landscape of SB, as captured by the network of dependencies, is completely different from the scenario predicted by extensive reuse. Our results reveal that the current iGEM repository strongly deviates from the desirable scenario where different parts are easily combined with others. Instead, it simply shows that a few parts are very often used as almost inevitable components with no additional reuse. The immediate consequence of this is a very limited potential for innovative combinations. As it happened with the early crisis of software development, the promise of reusability within SB will not be met unless new directions are taken.
Although the poor reusability potential of the iGEM landscape paints a rather grim picture, we might gain some insight from the historical development of software. In the early years of IT, programming required to get very close to hardware: programming involved so-called machine code, essentially writing chains of 0’s and 1’s and it was difficult to write and very time-consuming. The first language, FORTRAN, was easy to learn and simplified software development. At that time, it was assumed that the programming language would eliminate coding and debugging, solving problems at a low cost. Just ten years later, these expectations completely failed to be fulfilled. It was discovered that bugs are always present and considerable effort is invested in finding and correcting them (McIlroy, 1976). Poor programming practices and inappropriate integration in large software projects have led to disaster (Hey and Pśpay, 2015): in 2002, software bugs cost the US economy an estimated $59 billion annually, or about 0.6 percent of the gross domestic product (Tassey, 2002). Correcting and maintaining large software projects cannot be achieved without a cost and good solutions have only been obtained after the introduction of software engineering. Nevertheless, we can claim an overall success for software as a major technology.
As for the SB engineering practices, two main points should be made. One is that the lack of reusability unraveled by our analysis is a consequence of independent, unconnected and non-standardised experimental practices. The quality and depth of documentation of Biobricks has been improving over time, but we still see a growing library of unconnected, case-specific constructs with small (or no) reusability. Even if documentation is complete, we lack understanding of what to expect when different, previously unrelated parts are combined for a new construct. An improved theoretical framework is much needed. Secondly, we should also face the possibility of novel ways of thinking about engineering biology. A major difference between SB and its technological counterparts is that synthetic constructs are built to work inside an already functional biological machinery where hardware and software meld. Even if we accept the metaphor of cells as machines (Lazebnik 2002) most of the current SB is intended to add or modify very small parts of an organism that is different in many ways from any existing man-made hardware. Our limited capacity for replacing or connecting multiple parts effectively confines ourselves within the domains not of technology but tinkering.
Acknowledgments
The authors would like to thank A. Crompton for providing the database of LEGO parts. This work has been supported by an ERC Advanced Grant Number 294294 from the EU seventh framework program (SYNCOM), by the EU seventh framework program (ST-Flow Project), grants of the Botin Foundation, by Banco Santander through its Santander Universities Global Division (SV,RS), by the Spanish Ministry of Economy and Competitiveness, Grant FIS2013-44674-P and FEDER (SV) and by the Santa Fe Institute (RS).