## 0 Summary

Artificial Intelligence presents an important paradigm shift for science. Science is traditionally founded on theories and models, most often formalized with mathematical formulas handcrafted by theoretical scientists and refined through experiments. Machine learning, an important branch of modern Artificial Intelligence, focuses on learning from data. This leads to a fundamentally different approach to model-building: we step back and focus on the design of algorithms capable of building models from data, but the models themselves are not designed by humans. This is even more true with deep learning, which requires little engineering by hand and is responsible for many of Artificial Intelligence’s spectacular successes [30]. In contrast to logic systems, knowledge from a deep learning model is difficult to understand, reuse, and may involve up to a billion parameters [10]. On the other hand, probabilistic machine learning techniques such as deep learning offer an opportunity to tackle large complex problems that are out of the reach of traditional theory-making. It is possible that the more intuition-like [30] reasoning performed by deep learning systems is mostly incompatible with the logic formalism of mathematics. Yet recent studies have shown that deep learning can be useful to logic systems and vice versa. Success at unifying different paradigms of Artificial Intelligence from logic to probability theory offers unique opportunities to combine data-driven approaches with traditional theories. These advancements are susceptible to impact significantly biological sciences, where dimensionality is high and limit the investigation of traditional theories.

## 1 A.I. and knowledge representation

Science would greatly benefit from a unification of Artificial Intelligence with traditional mathematical theories. Modern research at the intersection of logic, probability theory, and fuzziness yielded rich representations increasingly capable of formalizing scientific knowledge. Such formal corpus could both include hand-crafted theories from Einstein’s *e* = *mc*^{2} to the Breeder’s equation [38], but also harness modern A.I. algorithms for testing and learning.

Comprehensive synthesis is difficult in fields like biology, which have not been reduced to a small set of formulas. For example, while we have a good idea of the underlying forces driving evolution, we struggle to build effective predictive models of molecular evolution [18]. This is likely because selection changes in time and space [4], which brings population, community, and ecosystem ecology into the mix. Ecology also has a porous frontier with evolution: speciation is a common theme in community ecology theory [12].

From a theoretical perspective, work to formalize scientific theories would reveal much about the nature of our theories. Surely, scientific theories require more flexibility than mathematical corpora of knowledge, which are based on pure logic. From a practical standpoint, a formal representation both offers ways to test large corpora of knowledge and extend it with A.I. techniques. This is arguably the killer feature of a formal representation of scientific knowledge: allowing A.I. algorithms to search for revisions, extensions, and discover new rules. This is not a new ambition. Generic techniques for rule discovery were well-established in the 1990s [37]. Unfortunately, these techniques were based on pure logic, and purely probabilistic approaches to revision cannot handle mathematical theories. Recent experiences in linguistics has shown that building a knowledge base capable of handling several problems at the same time yielded better results than attacking the problem in isolation because of the problems’ interconnectedness [45]. Biology, as a complex field made of more-or-less arbitrary subfields, could gain important insights from unified approach to knowledge combining A.I. techniques with traditional mathematical theories.

## 2 A quick tour of knowledge representations

Deep learning is arguably the dominant approach in probabilistic machine learning, a branch of A.I. focused on learning models from data [17]. The idea of deep learning is to learn multiple levels of compositions. If we want to learn to classify images for instance, the first layer of the deep learning network will read the input, the next layer will capture lines, the next layer will capture contours based on lines, and then more complex shapes based on contours, and so on [17]. In short, the layers of the network begin with simple concepts, and then compose more complicated concepts from simpler ones [5]. Deep learning has been used to solve complex computer science problems like playing Go at the expert level [41], but it is also used for more traditional scientific problems like finding good candidate molecules for drugs, predicting how changes in the genotype affect the phenotype [31], or just recently to solving the quantum many-body problem [9].

In contrast, traditional scientific theories and models are mathematical, or logic-based. Einstein’s *e* = *mc*^{2} established a logical relationship between energy *e*, mass *m*, and the speed of light *c*. This mathematical knowledge can be reused: in any equation with energy, we could replace *e* with *mc*^{2}. This ability of mathematical theories to establish precise relationships between concepts, which can then be used as foundations for other theories, is fundamental to how science grows and forms an interconnected corpus of knowledge. Furthermore, these theories are compact and follow science’s tradition of preferring theories as simple as possible. There are many different foundations for logic systems. Predicate logic is a good starting point: it is based on predicates, which are functions of terms to a truth value. For example, the predicate *PreyOn* could take two species, a location, and return true if the first species preys on the second at that location, like *PreyOn*(*W olverine*, *Squirrel*, *Quebec*). Terms are either *constants* such as 1, *π*, or *Wolverine, variables* that range over constants, such as *x* or *species*, or *functions* that map terms to terms, such as additions, multiplication, integration, differentiation. In *e* = *mc*^{2}, the equal sign = is the predicate, *e* and *m* are variables, *c* and 2 are constants, and there are two functions: the multiplication of *m* by *c*^{2} and the the exponentiation of *c* by 2. The key point is that such formalism lets us describe compact theories and understand precisely how different concepts are related. Complex logic formulas are built by combining predicates with connectives such as negation ¬, “and” ∧, “or” ∨, “implication” ⇒. We could have a rule to say that predation is asymmetrical *s*_{x} ≠ *s*_{y} ∧ *PreyOn*(*s*_{x}, *s*_{y}, *l*) ⇒ ¬*PreyOn*(*s*_{y}, *s*_{x}, *l*), or define the classical Lotka-Volterra: where *x* and *y* are the population sizes of the prey and the predator, respectively, *α*, *β*, *δ*, *γ* are constants, and the time differential *d*/*dt*, multiplication and subtraction are functions. Equality (=) is the sole predicate in this formula. Both predicates are connected via ∧ (“and”). Not all logic formulas have mathematical functions. Simple logic rules such as *Smoking*(*p*) ⇒ *Cancer*(*p*) (“smoking causes cancer”) are common in expert systems.

Artificial Intelligence researchers have long being interested in logic systems capable of scientific discoveries, or simply capable of storing scientific and medical knowledge in a single coherent system (Figure 1). DENDRAL, arguably the first expert system, could form hypotheses to help identify new molecules using its knowledge of chemistry [32]. In the 1980s, MYCIN was used to diagnose blood infections (and did so more accurately than professionals) [8]. Both systems were based on logic, with MYCIN adding a “confidence factor” to its rules to model uncertainty. Other expert systems were based on probabilistic graphical models [27], a field that unites graph theory with probability theory to model the conditional dependence structure of random variables [27, 2]. For example, Munin had a network of more than 1000 nodes to analyze electromyographic data [14], while PathFinder assisted medical professional for the diagnostic of lymph-node pathologies [22] (Figure 2). While these systems performed well, they are both too simple to store generic scientific knowledge and too static to truly unify Artificial Intelligence with scientific research. The ultimate goal is to have a representation rich enough to encode both logic-mathematical and probabilistic scientific knowledge.

## 3 Beyond monolithic systems

In terms of representation, expert systems generally used a simple logic system, not powerful enough to handle uncertainty, or purely probabilistic approaches unable to handle complex mathematical formulas. In terms of flexibility, the expert systems were hand-crafted by human experts. After the experts established either the logic formulas (for logic systems like DENDRAL) or probabilistic links (in the case of systems like Munin), the expert systems act as static knowledge bases, capable of answering queries but unable of discovering new rules and relationships. While no system has completely solved these problems yet, much energy has been put in unifying logic-based systems with probabilistic approaches [16]. Also, several algorithms have been developed to learn new logic rules [37], find the probabilistic structure in a domain with several variables [46], and even transfer knowledge between tasks [36]. Together, these discoveries bring us closer to the possibility of flexible knowledge bases contributed both by human experts and Artificial Intelligence algorithms. This has been made possible in great part by efforts to unify three distinct languages: probability theory, predicate logic, and fuzzy logic (Fig 3).

The core idea behind unified logic/probabilistic languages is that formulas can be weighted, with higher values meaning we have greater certainty in the formula. In pure logic, it is impossible to violate a single formula. With weighted formulas, an assignment of concrete values to variables is only *less likely* if it violates formulas. The higher the weight of the formula violated, the less likely the assignment is. It is conjectured that all perfect numbers are even (∀*x*: *Perfect*(*x*) ⇒ *Even*(*x*)), if we were to find a single odd perfect number, that formula would be refuted. It makes sense for mathematics but for many disciplines, such as biology, important principles are only expected to be true *most* of the times. To illustrate, in ecology, predators generally have a larger body weight than their preys, which can expressed in predicate logic as *PreyOn*(*predator*,*prey*) ⇒ *M*(*predator*) > *M*(*prey*), with *M*(*x*) being the body mass of *x*. This is obviously false for some assignments, for example *predator*: *greywolf* and *prey*: *moose*. However, it is useful knowledge that underpins many ecological theories [44]. When our domain involves a great number of variables, we should expect useful rules and formulas that are not always true.

The idea of weighted formulas is not new. Markov logic, invented a decade ago, allows for logic formulas to be weighted [39, 13]. It supports algorithms to add weights to existing formulas given a data-set, learn new formulas or revise existing ones, and answer probabilistic queries. For example, Yoshikawa et al. used Markov logic to understand how events in a document were time-related [45]. Their research is a good case study of interaction between traditional theory-making and artificial intelligence. The formulas they used as a starting point were well-established logic rules to understand temporal expressions. From there, they used Markov logic to weight the rules, adding enough flexibility to their system to beat the best approach of the time. Brouard et al. [7] used Markov logic to understand gene regulatory network, noting how the resulting model provided clear insights, in contrast to more traditional machine learning techniques. Expert systems can afford to make important sacrifices to flexibility in exchange for a simple representation. Yet, a system capable of representing a large body of scientific knowledge will require a great deal of flexibility to accommodate various theories. While a step in the right direction, even Markov logic may not be powerful enough.

## 4 Case study: The niche model

To show some of the difficulties of representing scientific knowledge, we will build a small knowledge base for an established ecological theory: the niche model of trophic interactions [44]. The first iteration of the niche model posits that all species are described by a niche position *N* (their body size for instance) in the [0,1] interval, a diet *D* in the [0, *N*] interval, and a range *R* such that a species preys on all species with a niche in the [*D* − *R*/2, *D*+*R*/2] interval. We can represent these ideas with three formulas:
where ∀ reads *for all* and ⇔ is logical equivalence (it is true if and only if both sides of the operator have the same truth value, so for example *False* ⇔ *False* is true and *True* ⇔ *False* is false). As pure logic, this knowledge base makes little sense. Formula 2a is obviously not true all the time. It is mostly true, since most pairs of species do not interact. We could also add that cannibalism is rare ∀*x*: ¬*PreyOn*(*x*,*y*) and that predator-prey are generally asymmetrical ∀*x*, *y*: *PreyOn*(*x*,*y*) ⇒ ¬*PreyOn*(*y*,*x*). In hybrid probabilistic/logic approaches like Markov logic, these formulas would have a weight that essentially defines a marginal probability [13, 25]. Formulas that are often wrong are assigned a lower weight but can still provide useful information about the system. The second formula says that the diet is smaller than the niche value. The last formula is the niche model: species *x* preys on *y* if and only if species *y*’s niche is within the diet interval of *x*.

So far so good! Using Markov logic networks and a data-set, we could learn a weight for each formula in the knowledge base. This step alone is useful and provide insights into which formulas hold best. With the resulting weighted knowledge base, we could make probabilistic queries and even attempt to revise the theory automatically. We could find, for example, that the second rule does not apply to parasites or some group and get a revised rule such as ∀*x*: ¬*Parasite*(*x*) ⇒ *D*(*x*) < *N*(*x*). However, Markov logic networks struggle when the predicates cannot easily return a simple true-or-false truth values. For example, let’s say we wanted to express the idea that when populations are small and have plenty of resources, they grow exponentially [29].

where *P*(*x*, *l*, *t*) is the population size of species *x* in location *l* at time *t*, *G* is the rate of growth, *SmallP* is whether the species has a small population and *Resources* whether it has resources available. The problem with hybrid probabilistic/logic approach is that predicates do not capture the inherent vagueness well. We can establish an arbitrary cutoff for what a small population is, for example by saying that if it is less than 10% the average population size for the species, it is small. Similarly, resource availability is not a binary thing, there is a world of grey between starvation and satiety. Perhaps worst of all, the prediction that *P*(*x*, *l*, *t* + 1) = *G*(*x*) × *P*(*x*, *l*, *t*) is almost certainly never be exactly true. If we predict 94 rabbits and observe 93, the formula is false. Weighted formulas help us understand *how often a rule is true*, but in the end the formula has to give a binary truth value: true or false, there is no place for vagueness.

Fuzzy sets and many-valued (“fuzzy”) logics were invented to handle vagueness [47, 24, 6, 3]. In practice, it simply means that predicates can return any value in the [0,1] closed interval instead of only true and false. It is used in both probabilistic soft logic [26, 1] and deep learning approaches to predicate logic [48, 23]. For our formula 3, *SmallP* could be defined as 1 − *P*(*x*, *l*, *t*)/*P*_{max}(*x*), where *P*_{max}(*x*) is the largest observed population size for the species. *Resources* could take into account how many preys are available, and *P*(*x*, *l*, *t* + 1) = *G*(*x*) × *P*(*x*, *l*, *t* would return a truth value based on how close the observed population size is the predicted population size. Fuzzy logic then defines how operators such as *and* and ⇒ behave with fuzzy values.

Both Markov logic networks and probabilistic soft logic define a probability distribution over logic formulas, but what about the large number of probabilistic models? For example, the niche model has a probabilistic counter-part defined as [43]: where *PreyOn*(*x*,*y*) is the probability that *x* preys on *y*. Again, this formula is problematic in Markov logic because we cannot easily force the equality into a binary true-or-false, but fuzziness can help model the nuance of probabilistic predictions.

## 5 Where’s our unreasonably effective paradigm?

Wigner’s *Unreasonable Effectiveness of Mathematics in the Natural Sciences* led to important discussions about the relationship between physics and its main representation [42, 21]. The Mizar Mathematical Library and the Coq library [33] host tens of thousands of mathematical propositions to help build and test new proofs. In complex domains with many variables, Halevy et al. argued for the *Unreasonable Effectiveness of Data* [19], noting that simple algorithms, when fed large amount of data, would do wonder. High-dimensional problems like image imputation, where an algorithm has to fill missing parts from an image, require hundred of thousands of training images to be effective. Goodfellow et al. noted that roughly 10 000 data-points per possible labels were necessary to train deep neural networks [17]. These approaches are unsatisfactory for fields like biology where theories and principles are seldom exact. We cannot afford the pure logic-based knowledge representations favoured by mathematicians and physicists, and fitting a model to data is a different task than building a corpus of interconnected knowledge.

Fortunately, we do not need to choose between mathematical theories, probabilistic models, and learning. New inventions such as Markov logic networks and probabilistic soft logic are moving Artificial Intelligence toward rich representations capable of formalizing and even extending scientific theories. This is a great opportunity for synthesis. There are still problems: inference is often difficult in those rich representations. Recently, Garnelo et al. [15] designed a prototype to extract logic formulas from a deep learning system, while Hu et al. [23] created a framework to learn predicate logic rules using deep learning. Both studies used flexible fuzzy predicates and weighted formulas while exploiting deep learning’ ability to model complex distributions via composition. The end result is a set of clear and concise weighted formulas supported by deep learning for scalable inference. The potential for science is important. Not only these new researches allow for deep learning to interact with traditional theories, but it opens many exciting possibilities, like the creation of large databases of scientific knowledge. The only thing stopping us from building a unified corpus of, say, ecological knowledge, is that normal pure-logic systems are too inflexible. They do not allow imperfect, partially-true theories, which are fundamental to many sciences. Recent developments in Artificial Intelligence make these corpora of scientific knowledge possible for complex domains, allowing us to combine a traditional approach to theory with the power of Artificial Intelligence.

It is tempting to present deep learning as a threat to traditional theories. Yet, there is a real possibility that the union of Artificial Intelligence techniques with mathematical theories is not only possible, but would help the integration of knowledge across various disciplines. Otherwise, short of discovering a small set of elegant theories, what is our plan to combine ideas from ecosystem ecology, community ecology, population ecology, and evolution?

## 6 Acknowledgements

PDP has been funded by an Alexander Graham Bell Graduate Scholarship from the National Sciences and Engineering Research Council of Canada, an Azure for Research award from Microsoft, and benefited from the Hardware Donation Program from NVIDIA. DG is funded by the Canada Research Chair program and NSERC Discovery grant. TP is funded by an NSERC Discovery grant and an FQRNT Nouveau Chercheur grant.

## Footnotes

↵0 email: philippe.d.proulx{at}gmail.com