clinker & clustermap.js: Automatic generation of gene cluster comparison figures

Summary Genes involved in biological pathways are often collocalised in gene clusters, the comparison of which can give valuable insights into their function and evolutionary history. However, comparison and visualisation of gene cluster homology is a tedious process, particularly when many clusters are being compared. Here, we present clinker, a Python based tool, and clustermap.js, a companion JavaScript visualisation library, which used together can automatically generate accurate, interactive, publication-quality gene cluster comparison figures directly from sequence files. Availability and Implementation Source code and documentation for clinker and clustermap.js is available on GitHub (github.com/gamcil/clinker and github.com/gamcil/clustermap.js, respectively) under the MIT license. clinker can be installed directly from the Python Package Index via pip. Contact E-mail: cameron.gilchrist@research.uwa.edu.au, yitheng.chooi@uwa.edu.au


20
Genes involved in biological pathways are often collocalised in the genome in gene clusters, the study of which is 21 important given their potential to encode traits such as the biosynthesis of bioactive small molecules, virulence and 22 drug resistance. Gene clusters are typically conserved to some extent across taxa, and comparative analysis can 23 reveal significant insights into their evolution and potential differences in the pathways they encode. For instance, 24 the conservation (or lack thereof) of gene clusters encoding the biosynthesis of secondary metabolites can guide 25 the search for new drug compounds and bioactivities (Gilchrist et al., 2018, Ziemert et al., 2016. 26 Comparative analysis of large genomic datasets has become increasingly common in the post-genomics era, 27 with visualisations of gene cluster conservation across species often featured prominently in such work (de Vries 28 et al., 2017, Nielsen et al., 2017, Doroghazi and Metcalf, 2013. This is a tedious process, and as such has driven 29 the development of numerous tools (e.g. EasyFig (Sullivan et al., 2011), Artemis Comparison Tool (Carver et al., 30 2005), GeneSpy (Garcia et al., 2019), Gene Graphics (Harrison et al., 2018)) and libraries (e.g. GenomeDiagram 31 (Pritchard et al., 2006), genoPlotR (Guy et al., 2010)). However, these tools have drawbacks: most have limited 32 interactivity, with clusters unable to be repositioned, reordered or resized without changing input files; visualisation 33 options are inflexible, if not fixed; some require sequence comparison files to be generated externally using tools Illustrator), or directly use the output generated by analysis tools such as MultiGeneBlast (Medema et al., 2013), 38 cblaster (Gilchrist, 2020), or antiSMASH (Blin et al., 2019. This quickly becomes a daunting prospect when 39 many clusters are to be compared. There is therefore a need for tools which produce clear, publication-quality 40 visualisations, are more flexible and intuitive than existing tools, and require no significant effort on the part of 41 the user other than providing sequence data.

42
Here we describe clinker, a Python-based gene cluster comparison pipeline, and clustermap.js, a companion 43 visualisation JavaScript library, which can automatically generate interactive, to-scale gene cluster comparison 44 figures directly from sequence files.

45
Gene cluster alignments using clinker 46 The clinker workflow is detailed in Figure 1a. clinker accepts GenBank files as input; multi-record GenBank   can be changed when directly using the clinker Python API. Any alignments not reaching the user-defined 51 sequence identity threshold are discarded.

52
Optimal ordering of clusters for visualisation is determined through hierarchical clustering. First, a cluster 53 similarity score is calculated for every pair of input clusters. clinker uses a modified version of the formula 54 implemented by Medema et al. (2013) in MultiGeneBlast, which incorporates both sequence homology and syntenic 55 conservation. Briefly, the similarity of two clusters is given by S = h + i · s, where h is the cumulative sequence 56 identity of homologous sequences in each cluster, s is the total number of shared contiguous sequence pairs, and 57 i is a weighting factor which determines the weight of synteny compared to sequence homology (0.5 by default).

58
The resulting similarity matrix is then hierarchically clustered using the Ward variance minimization algorithm 59 as implemented in the SciPy package (Virtanen et al., 2020). The leaves of the dendrogram generated in this step 60 are used as the default order of clusters in the clustermap.js visualisation.

61
Interactive visualisations using clustermap.js 62 Cluster alignments generated using clinker are visualised using clustermap.js (Figure 1b). Though designed 63 in conjunction with clinker, clustermap.js is a standalone library and can take data generated elsewhere as 64 long as it abides by the correct schema.

65
Clusters are drawn to scale based on a user-definable scaling factor (15 pixels per 1000 base pair by default).

66
Clusters can be renamed by clicking their labels and reordered by dragging them, enabling comparisons between 67 clusters outside of the computed optimal ordering. 68 clustermap.js is capable of displaying multi-locus clusters. For example, the biosynthetic gene cluster for the 69 burnettramic acids (Li et al., 2019) is split over three contigs due to fragmented genome assembly, but is readily 70 visualised by our tool (top-most cluster in Figure 1b). Hovering over a locus displays a grey box with handles 71 at both ends; loci can be freely repositioned by clicking and dragging the box, inverted by double clicking it, or 72 resized (hiding genes) by dragging the handles. Any changes in the visible area of specific loci is reflected by 73 the coordinates in the cluster label. Genes are drawn as arrows, whose shape is easily altered using sliders for 74 body height, tip height and length. The visualisation can be anchored around a specific gene by clicking on it: 75 clustermap.js will identify the closest homologous genes in other clusters and automatically reposition them to 76 align with the clicked gene. Gene labels can be resized, repositioned or hidden entirely.

77
Links are drawn between homologous genes on neighbouring clusters and are shaded based on sequence identity 78 (0% white, 100% black). By default, all links above the identity threshold in clinker are drawn; this threshold 79 can be raised within the clustermap.js visualisation to dynamically hide links. Alternatively, one can choose to 80 show only the highest scoring links between clusters, which is useful when multiple similar genes exist within a 81 cluster, resulting in many overlapping links. Groups of homologous genes are established via single linkage of gene 82 links. Each homology group is assigned a unique colour, which is used as the fill of both the genes in the group  (Medema et al., 2013) or cblaster (Gilchrist, 2020).

91
In conclusion, clinker and clustermap.js enable the easy creation of publication quality gene cluster com-