Large scale function-based genome prospecting for industrial traits applied to 1,3-propanediol production

Due to the success of next-generation sequencing, there has been a vast build-up of microbial genomes in the public repositories. FAIR genome prospecting of this huge genomic potential for biotechnological benefiting, require new efficient and flexible methods. In this study, Semantic Web technologies are applied to develop a function-based genome mining approach that follows a knowledge and discovery in database (KDD) protocol. Focusing on the industrial important trait of 1,3-propanediol (1,3-PD) production 187 new candidate species were identified. Furthermore, the genetic architecture of the particular trait was resolved, and persistent domains identified.


Introduction
Next generation sequencing (NGS) technologies have turned the publicly available genome repositories into data-rich scientific resources that currently contain structural and functional genome information of hundred of thousands bacterial genomes (1) available for further exploitation for medical, environmental and biotechnological purposes. While the biotechnology field has embraced the use of omics technologies in biotech innovation in multiple ways, functional screening of genomic resources for industrially relevant traits at a large scale remains challenging. NGS data generation has caused manual curation to be outpaced rapidly and currently more than 99% of the functional predictions in UniProt are based on automated computational predictions (2). The accuracy of these computationally inferred functional genome annotations is largely unknown due to inconsistent naming of protein functions, use of annotation pipelines of various quality, over-prediction due to the use of generic, nonstandardised, annotation acceptance thresholds (3), and lack of data provenance in general. This results in a degree of interoperability lower than required for efficient function based genome prospecting (4). As the predicted gene and protein sequences, underlying the functional annotations are available in the standardised FASTA format, 'bottom-up' approaches that start with a gene or protein sequence in an effort to find a corresponding function in genome(s) of interest are normally used. However, over larger phylogenetic distances, strategies that use sequence-based clustering algorithms are hampered by accumulation of evolutionary signals, lateral gene transfer, gene fusion/fission events and domain expansions (5). Furthermore, when applied at larger scales, such strategies suffer from high computational time and memory requirements that scale quadratically with the number of genome sequences to be compared (6). A robust functionbased genome screening requires a sufficiently high degree of interoperability meaning that functional information can be directly compared on the basis of a pre-established syntactic interoperable genome annotation and that any computational functional prediction is directly linked to its provenance. To accomplish this, we have developed SAPP, a semantic annotation infrastructure supporting FAIR computational genomics (7). SAPP uses the Genome Biology Ontology Language (GBOL) as syntax (8) and automatically predicts, tracks and stores structural and functional genome predictions and associated data-and element-wise provenance in RDF, the W3C standard for representing linked data (9). SAPP uses protein domain architectures to systematically describe protein functions (10). Demonstrating a high level of scalability, this set up has already been successfully used in an integrated analysis of the functional landscape of 432 Pseudomonas strains (4). In this study, interoperable genome annotations are used in a systematic function-based in silico screening for bacterial species that potentially can convert the bio-refinery by-product glycerol in 1,3-propanediol (1,3-PD) (11,12). 1,3-PD is an important precursor of biomaterials and it is currently used as a monomer for novel polyester and biodegradable plastics, such as polytrimethylene terephthalate (11). 1,3-PD is a typical product of anaerobic glycerol fermentation and only a limited number of species, mostly enterobacteria, are known to form it. The first step, dehydration of glycerol to 3-hydroxypropionaldehyde (3-HPA) is mediated by a vitamin B12-dependent glycerol dehydratase although an oxygen sensitive, B12-independent alternative enzyme has also been reported (13). The second step reduces the toxic 3-HPA intermediate to 1,3-PD using the (NADH)+H+ dependent 1,3-propanediol-oxydoreductase (PDOR) regenerating NAD+ (12). For large scale genome prospecting a collection of 84,308 publicly available bacterial genomes were loaded in the SAPP semantic framework, systematically structurally and functionally annotated and subsequently mined for candidate 1,3-PD producers using a knowledge discovery in databases approach (KDD) (14). Overall, the systematic analysis increased our knowledge on the genetic architecture of this metabolic trait, in terms of the overall domain composition, domain order, distribution, persistence and essentiality. The approach suggested that, in addition to the some 30 producers reported in literature (table 1), at least 187 other genome sequenced species potentially have this trait.

Results
Development of the genome mining workflow. To cope with the influx of biological input data, we followed the Knowledge Discovery in Databases multi step process model (14), which was adjusted to the specific needs of being able to prospect genome data at a large scale. The KDD process includes multiple steps with the first step being to capture relevant prior knowledge. In the next step, a selection of data is used as a training set to develop a screening procedure followed by data mining at a large scale and finally the evaluation and interpretation of the results.
Incorporation of prior knowledge. In 1,3-PD production, two pathways, an oxidative and reductive pathway, are used in parallel for dissimilation of glycerol. In the oxidative pathway, glycerol is first dehydrogenated to dihydroxyacetone by a NAD+-linked glycerol dehydrogenase, and then converted to dihydroxyacetone phosphate by an ATP-dependent dihydroxyacetone kinase (15). Additionally, two alternative reductive pathways, one dependent on vitamin B12, the other not, are used for regenerating NAD+ with 1.3-PD as byproduct. For validation of the search strategy and to gain further insights into the genetics and the domains involved in 1,3-PD formation a list of known 1,3 PD producing strains was obtained for cross-validation. From literature some 30 strains have been reported to produce 1,3-PD (Table 1). For training purposes genome sequences from 13 different strains could be obtained from the EBI-ENA data warehouse (1) as not for all of them genome sequences were available. For Citrobacter freundii ATCC8090 two draft genome sequences of comparable quality were available and initially both were added to the training set increasing the total number of genomes.

Data transformation.
To obtain a high degree of interoperability the selected genome sequences were de novo structurally and functionally annotated using the Semantic annotation Platform with Provenance (SAPP) (7) using Prodigal (16) for gene prediction and InterProScan (17) for protein annotation. Protein domain architectures were used to systematically describe the encoded protein functions (10). Structural and functional genome annotations were subsequently transformed into Linked Data fragments using the Resource Description Framework (RDF) as a data-metadata model (7,18).
A function-based search strategy for the oxidative branch. The first two enzymes essential for the oxidative  Table 2).

Development of a function-based search strategy for the reductive branch.
For NAD+ regeneration and 1,3-PD production two alternative pathway are known, a well-studied B12-dependent pathway and a lesser studied oxygen sensitive B12-independent pathway (figure 1). Pattern recognition analysis that was used to obtain the DAK configuration was used to obtain the domain persistency and copy number of key domain classes in the two alternative pathways among the genomes of the training set. The iron containing aldehyde reductase/alcohol dehydrogenase domain (PF00465), is present in both the oxidative and reductive branch but functions in the reductive branch as an aldehyde reductase. The three dehydratase Pfam domains signifying the B12-dependent pathway, PF02287 (small subunit), PF02288 (medium subunit), and PF02286 (large subunit) have a domain class persistency of 0,6 with, when present, on average per genome 1,7 (large and small subunit) or 3,7 copies (medium subunit) in the training set. The diol dehydratase reactivase ATPase-like domain PF08841 has a persistency of 0,6 and a copy number of 1. Note that only four strains in the training data set were molecularly characterized. The degree of persistency of these key domains of the B12-dependent pathway in the training set therefore suggests that multiple 1,3-PDproducing strains of the training set may actually not use the B12-dependent pathway for 1,3 PD-production. Alternatively, a B12-independent glycerol dehydratase function is reported to be involved in 1,3-PD production (13) and consists of a glycine radical (PF01228) and a pyruvate formate lyase-like domain (PF02901). Both domains are present in all 13 genomes of the training data set with a relative high copy number of 6. The persistency of the radical SAM superfamily domain, a key part of the B12-independent glycerol dehydratase reactivase enzyme (19), was also 1 with an average copy number of 32 indicating that these three domains are ubiquitous and promiscuous and normally also used to support other functionalities ( Table 3). As the high copy number of persistent signifying domains of both the B12-dependent and B12-independent reductive branch suggests that they play roles in multiple processes, a Boolean multi-compound proximity search was developed for the reductive branch. In this approach, the physical co-localization of the persistent signifying domain classes in the respective genomes is included in the search. From the list of natural producers with a molecularly characterized 1,3-PD operon, the size of 1,3-PD operon was estimated to encompass approximately 18,000 bases. Furthermore, signifying domains could be found on both strands due to the use of internal promoters.
For the B12-independent pathway search criteria were based on prior knowledge obtained from a single characterized strain (13). Here four domains were specified (PF00465, PF01228, PF02901, PF04055) extending 20.000 base pairs up and downstream of the alternative dehydratase domains. Furthermore, taking internal promoters into account, domains were allowed to be present on both strands when genes are facing in the opposite direction in case of bidirectional promoter.

Data preparation.
To be able to mine a large set of publicly available genomes for the presence of the vitamin B12 variant 1,3-PD operon, 84,308 bacterial genomes containing 51 phyla, 64 classes, 145 orders, 335 families, 1,126 genera and 2,661 species were downloaded from the EBI-ENA data warehouse. The SAPP annotation platform was subsequently used for de novo structural and functional annotation of these genomes resulting in a semantic database of 300,196,266 predicted protein encoding genes linked to the corresponding protein sequences, predicted protein domain architectures and provenance.
Data Mining for candidate 1,3 PD producers. Two parallel pathways must be present to convert glycerol to 1,3-PD. Data mining for candidate 1,3-PD producers was therefore split in two steps. First, the search space was reduced by searching for the presence of a possible oxidative branch. In the second step, the reduced search space was searched for the presence of a possible reductive branch. As the molecular knowledge obtained for B12-independent pathway is based on the characterization of a single Clostridium butyricum strain, data mining for the reductive branch focused on finding candidate B12-dependent 1,3 PD producers.
Oxidative branch: Following the function-based search strategy outlined above and summarised in figure 1, a proximity search with the two DAK configurations in parallel resulted in a 41% reduction of the search space.  Table 6. For further analysis, strains were collapsed in species groups and each species group was treated as one. At species level, domain connectivity (figure 2) and persistency of the B12dependent 1,3-PD trait showed a high degree of similarity with the training set (Table 6).
Additional constraints. As the candidate species are depending on vitamin B12 for the production of 1,3-PD it could be beneficial to additionally select for the ability to synthesize vitamin B12. Secondly, the best known natural B12 dependent 1,3-PD producers are considered to be pathogenic or opportunistic pathogenic microorganisms such as Klebsiella pneumonia (20). Following the strategy outlined above a function-based screening was developed for the distribution of the trait for vitamin B12 synthesis among the candidate species. Using a multi-domain approach containing all domain signatures reuired for vitamine B12 sysnthesis a vitamin B12 (21) biosynthesis pathway was detected in 86 of the 187 candidate species using a multi-domain approach containing all domain signatures required for vitamin B12 synthesis. In order to obtain the bio-safety level of each candidate species, the BacDive database was queried (22). Forty-seven candidate species were classified at bio-safety level 2 and higher   and 48 species were considered to be non-pathogenic (biosafety level 1) while 92 species remained unclassified. The bio-safety level 1 label was combined with the ability for vitamin B12 synthesis with resulted in the following candidate species; Anaeromusa acidaminophila, Anaerosalibacter massiliensis, Blautia obeum, Clostridium drakei, Clostridium magnum, Geobacillus thermoglucosidasius, Intestinimonas butyriciproducens and Thermincola ferriacetica, (see Table  suppl S1).

Discussion
Bioprospecting entails the systematic search for economically valuable genetic and biochemical resources from nature (23). For genome prospecting, the in silico mining of sequenced genomes and metagenomes for new biotechnologically relevant proteins and enzymes is a continuously evolving field. Two approaches can be applied, a "top-down" approach that begin by searching for a function and is followed by identification of the corresponding gene(s) and a "bottom-up" approach that start with a gene of interest in an effort to find a corresponding function in genome(s) of interest. For bottom-up approaches many sequence similarity tools such as Blast (24) exist. Bottom-up studies usually start with a selection of (meta)genome sequences followed by sequence similarity-based clustering and selection of candidate sequences and re-annotation of interesting candidates thereby avoiding ambiguity related problems in current functional annotations. For bacterial species however, a priori, gene fusion-fission events can be expected (25) hampering sequence similarity-based detection and clustering of multidomain proteins encoding genes, while the same domains may also be present in multiple proteins. When searching for polygenic traits these problems are aggravated and thus a function-based approach searching for key functions may be more effective especially when there is insufficient understanding of the genetic architecture of trait in terms of the minimal number of genes required for the trait and function of the domains encoded by the different genes. In order to overcome ambiguity related problems in functional annotations, SAPP identifies and annotates protein function based on domain architecture. Protein domains have been shown to provide an accurate representation of the functional capabilities of a protein (10). To overcome ambiguity related problems in functional annotations SAPP identifies and annotates protein function based on domain architecture. As profile hidden Markov Models (HMM), which favour in their scoring functionally important sites, are used for the identification of protein domains, statistically robust annotation profiles can be obtained over large phylogenetic distances. KDD is a multi-step process. Crucia steps are data standardisation and feature extraction for pattern searching. By using protein domain architectures as proxy for protein functions, a high degree of standardisation is obtained. As protein domain architectures can be directly transformed in highly interoperable strings of Pfam domain identifiers, a "top-down" functional screening can be done efficiently.
Applying the KDD approach on Linked Data allows for validation of initial results in multiple ways and for iteration after modification. Driven by a strong need to be able to integrate and analyse biological datasets across databases, there has been a considerable increase in the adoption of Semantic Web technologies in the life-sciences (26). However, a SPARQL endpoint for phenotype data is only available via WikiData (27). Other resources for phenotype data such as BacDive, here used to obtain bio-safety levels of candidate species, do provide API's to mine their data but currently cannot be directly queried using SPARQL. With the growing importance of Semantic Web technologies for the life sciences interoperability levels on all aspect will increase, enhancing mining possibilities, aiding the discovery of new traits in an unprecedented pace. By using molecular knowledge obtained from molecular characterisations of the 1,3-PD operon of four species and validating and iterating the search pattern on a training data set of genome sequences of twelve known 1,3-PD producers, a large collection of 84,308 publicly available bacterial could be efficiently mined in a top-down approach yielding 187 new candidate species some of which could be confirmed by literature (28)(29)(30)(31)(32).

Conclusions
Through transformation of genomic data into a FAIR linkeddata format, iterative function-based approaches can be de-veloped to mine the large genome repositories. By presenting functional annotation as unambiguous protein domain architectures a high level of interoperability is obtained allowing for the development of efficient function-based top-down searches not limited to supervised trait identification.

Materials and methods
Data annotation and mining. Bacterial genomes were downloaded from the ENA database using the enaBrowser-Tools (1). All downloaded microorganisms were converted into a semantic repository using the annotation platform based on functional analysis (SAPP) (7). All genomes were de novo structurally re-annotated using Prodigal (16) and functionally annotated using Pfam from InterProScan (17). Each genome was stored as an individual graph database using the HDT library and were processed using a domain pattern recognition software that was developed to aid in the data processing (33,34). This application constructs operon-like structures around one or more protein domains from genomic data and discovers patterns in the present protein domains using a Frequent Pattern Mining and a Latent Dirichlet Allocation approach.
Pattern recognition. The application that was used for pattern recognition has been developed in Java 8 and uses Scala version 2.11. The annotated genomes are provided as RDF data in HDT format allowing to easily scale beyond the size limitations of current graph databases. Spark Core has been implemented to enable parallelization for the processing of the genomes. The Frequent Pattern Mining and Latent Dirichlet allocation algorithms from the Apache SPARK MLlib Machine Learning Library, are applied to analyse patterns in the domain architecture of operons Meng2016. The support value of a pattern with one domain is a measure for the local persistency of that specific domain. The support value is defined as the number of regions where the domain occurs divided by the total number of regions.
Oxidative branch identification. To identify strains containing the oxidative branch as shown in figure 1, DAK domains were processed using the domain pattern recognition software. Genes containing DAK1 (PF02733) or DAK1_2 (PF13684) in combination with DAK2, (PF02734), 95% gene confidence according to prodigal and neighbour linking was applied to identify strains capable for growth on glycerol.
Reductive branch identification. Strains that contained the oxidative branch were further investigated for the reductive branch using domains according to the B12 dependent or independent pathway. Region identification was performed using PF02287 and 20,000 base pairs up and downstream.
Regions containing all domains of interest were further investigated for domain occurrence and synteny.

Reductive branch identification.
Strains that contained the oxidative branch were further investigated for the reductive branch using domains according to the B12 dependent or independent pathway. Region identification was performed using PF02287 and 20,000 base pairs up and downstream. These regions were modified to only retain genes that form an operon-like structure. The biggest intergenic region allowed between two genes on the same strand was 50 nucleotides. To also include the possibility of a bidirectional promoter, a gap of 500 (or less) nucleotides is allowed between to stretches of genes on the reverse and forward strand. To prevent identical regions created around a gene which contains multiple copies of the PF02287 or different genes containing PF02287, regions that (partly) overlap are merged into one. The operonlike structure was preserved.
B12 synthesis identification. Phylogenetic information was complemented with phenotype information through Bac-Dive (22). The BacDive resource was parsed and each entry record was transformed into RDF allowing integration of genotype and phenotype information. SPARQL queries were used to retrieve information with regards to pathogenicity and temperature.