Empirical study on software and process quality in bioinformatics tools

Software quality in computational tools impacts research output in a variety of scientific disciplines. Biology is one of these fields, especially for High Throughput Sequencing (HTS) data, such tools play an important role. This study therefore characterises the overall quality of a selection of tools which are frequently part of HTS pipelines, as well as analyses the maintainability and process quality of a selection of HTS alignment tools. Our findings highlight the most pressing issues, and point to software engineering best practices developed for the improvement of maintenance and process quality. To help future research, we share the tooling for the static code analysis with SonarCloud which we used to collect data on the maintainability of different alignment tools. The results of the analysis show that the maintainability level is generally high but trends towards increasing technical debt over time. We also observed that the development activities on alignment tools are generally driven by very few developers and are not utilising modern tooling to their advantage. Based on these observations, we recommend actions to improve both maintainability and process quality in open source alignment tools. Those actions include improvements in tooling like the use of linters as well as better documentation of architecture and features. We encourage developers to use these tools in order to ease future maintenance efforts, increase user experience, support reproducibility, and ultimately increase the quality of research through increasing the quality of research software tools.

Biology and medicine have seen a data-driven transformation with the advent of 2 high-throughput experimentation, in particular high-throughput sequencing with ten 3 thousands of sequenced genomes. Computational tools integrate advanced algorithms, 4 data structures, and state-of-the-art methods from statistics and machine learning. 5 However, the developers and users of research tools are generally experts in their 6 respective fields and have little background in software engineering [1]. This hinders and  The main data source for RQ1 (characterization of tools) was the available 84 documentation, manual and GitHub repositories of the selected tools. Additionally, a 85 test installation and one or several test runs were performed using publicly available 86 DNA-seq (SRA: SRX3122951) and RNA-seq (SRA: SRP130955) data, using 87 Ubuntu 16.04 version. The software tools were characterized using 7 criteria selected 88 from literature [4, 7-9, 14, 36-38] and focusing on areas important to the everyday work 89 of users of these tools. The criteria are: 1) integration into the pipeline, 2) maintenance, 90 3) support, 4) usability, 5) documentation, 6) installability, and 7) dependency 91 management. For each criterion, a 3-level quality scale was defined ( Table 2) and 92 applied following tests and review of the documentation.
93 Table 2. Definitions of the different levels for the 7 characteristics of HTS tools used in this analysis pipeline integration 1 can run directly using any input and output files of standard format 2 requires specific input file name (e.g. * 1.fq and * 2.fq) 3 requires non standard file format or non standard changes within the input file maintenance 1 latest commit no more than 1 year before this analysis (2020) 2 more than 1, but less than 5 years inactivity before this analysis 3 more than 5 years inactivity support 1 has its own active GitHub issue / forum / bug report page 2 documentation about frequent errors and / or email support, but inactive issue page 3 user needs to rely on external forums or resources usability 1 has a graphical user interface in Galaxy or standalone, test finished 2 only command line interface AND running requires 1 command for a single task, OR test crashed with useful error messages 3 only command line interface AND running requires at least 2 commands for a single task, OR test crashed with hard to understand or missing error messages documentation 1 sufficient and necessary information for usage, easy to navigate 2 short, not well organized or hard to read 3 too short for effective usage, requires additional external resources is investigated by the bioinformatician before proceeding to subsequent, 100 application-specific data analysis. The analysis process relies on multiple tools and 101 frequent quality checks which drives the subsequent steps, such as the abortion of the 102 protocol or adjustment in the parameters of the tools.

103
For RQ2, we decided to focus on a single step in the investigated HTS pipelines. We 104 chose the mapping step, as several mappers have been developed and utilized in the 105 majority of HTS applications. Most mappers are written in the C or C++ languages 106 allowing for fair code-level comparison. We aimed to include several widely used open 107 source mappers [39]. This selection resulted in a total of 13 different mappers, all with 108 their code available on GitHub, being analysed. Table 3 lists the selected mappers and 109 provides a link to the each project's source code. Maintainability is one of the software qualities listed in ISO/IEC 25010 [40]. In the 112 model it is constructed from modularity, reusability, analysability, modifiability, and 113 testability [40]. According to Riaz et al., it can be summarised as "the ease with which a 114 software system can be modified" [41]. Maintainability is an indicator for the amount of 115 work needed to understand, reuse, and refactor a software component or project [42].

116
Making changes or fixing bugs takes less effort in a project with good 117 maintainability [42]. Maintainability is directly affected by the quality of the applied 118 development process. Sing and Gautam describe the connection between the two, 119 especially how the four development activities (i.e., requirements, design, coding, and 120 testing [43]) included in any type of development process (e.g. waterfall, scrum) can 121 impact the maintainability of the end product [44]. Table 4 shows exactly what 122 requirement characteristics, design attributes, coding factors, and testing parameters 123 can be utilised to improve maintainability. 124 We used static code analysis to obtain the necessary measures for the 125 maintainability evaluation of HTS mappers. According to Riaz et al. [41] there is a 126 multitude of different metrics in the existing research which are used to quantify the 127 maintainability of software projects. In this study, maintainability is inspected based on 128 code smells and technical debt. Code smells describe the presence of bad programming 129 and bad code design in a software artefact [45]. Code smells can cause maintainability 130 issues, decrease comprehensibility, and cause concern to professional developers [46]. Bitbucket repositories. These properties make SonarCloud appropriate for this research 143 and for future usage within the bioinformatics community. 144 We used manual analysis of SonarCloud, which is straightforward and can easily be 145 repeated on a multitude of different tools. To enable reproducibility, we decided to 146 provide a Docker image for the analysis. All dependencies for the project build and the 147 manual analysis are contained in the image and the commands for analysis are also 148 provided. The Docker image for replicating the analysis is publicly available on GitHub 149 (https://github.com/konradotto/sonar-analysis). This image has also been 150 published on DockerHub (https://hub.docker.com/r/konradotto/sonar-analysis) 151 and can therefore be fetched from there directly as konradotto/sonar-analysis.

152
As our analysis is focusing on maintainability, we selected the following measures  to compare projects based on their inherent number of code smells and technical debt. 165 The debt ratio and maintainability rating are directly related and already relative 166 measures that take the project size into account. This maintainability rating allocates 167 letter grades (i.e. in descending order: A, B, C, D, E) to projects based on their debt 168 ratio. Finally, the number of files and the number of functions provide superficial, but 169 valuable insight on the modularity and design-level differences between the projects. These measures have been chosen due to their perceived significance for the quality 187 of the development process. Their importance is supported by the models that have 188 been designed to predict "socio-technical issues" in [54] and [55]. The term "bus count" 189 indicates how many developers have in-depth knowledge of a project, system or 190 component (e.g. in a distributed system); it is the minimum number of developers that 191 would have to suddenly disappear to endanger a smooth continuation. Tags are defined 192 as snapshots in time. They are closely related to releases, and can be used to follow the 193 development of the code throughout time. The high number of commits per tagged 194 release for some of the other projects can be seen as a sign of missing direction in the 195 development or underutilisation of releases. In either case this means that development 196 and improvements are not sufficiently broken down into small increments (i.e. releases) 197 that are communicated to the users.

199
Characterisation of HTS tools 200 We summarised our findings on Table 5. The quality of the evaluated tools differed 201 across categories. In line with the findings of Mangul et al. [4], the majority of the tools 202 performed poorly on the scales of installability and dependency management, but most 203 achieved good scores in several other categories.

204
Integration of tools into workflows relies on scripts written by the users. For 205 example, running the same steps for several inputs, setting file structure and naming 206 conventions, parallelisation, downloading database dependencies are rarely included in 207 the tools. Workflow managers, such as Snakemake [56] or Nextflow [57] supports some 208 of these tasks. However, we noted some incompatibilities between tools which require 209 in-depth knowledge of them. One example is the presence (chrN) or absence (N) of 210 chromosome prefixes in input/output files, i.e., compatibility with UCSC or Ensembl 211 style reference genomes. Similarly, some tools are compatible with a specific subsequent 212 tool, such as Trim Galore! being compatible with Bowtie1 by performing an additional 213 base trimming from the input sequences [23]. This extra step is not necessary for the 214 compatibility with other tools. We also noted that some steps can be omitted from the 215 pipeline. For example, the BWA and Bowtie2 tools provide soft and hard clipping, thus 216 removing the need to rely on the trimming of reads. However, in some applications this 217 step is recommended.

218
Most of the investigated tools have been actively maintained at the time of the 219 analysis (2020) with the exception of PrinSeq [22]. To investigate trends in the 220 maintenance of tools, we performed an in-depth analysis of short sequence mappers in 221 RQ3. Most tools are supported at their own website or other platforms visited by the 222 creators. We noted that complete lack of support was rarely the case within the 223 bioinformatics community. Platforms such as Biostars [58] and StackExchange [59] 224 hosts extensive community knowledge for troubleshooting.

225
When considering usability, it is important to note that most tools are only available 226 with command line interfaces, limiting their usage to bioinformaticians or scientists with 227 knowledge in bash scripting. We argue that investing in a graphical user interface would 228 increase the learnability of these tools and (potentially at the expense of a reduced 229 number of available settings) would enable more researchers to utilize them in their 230 work. One such example is the Galaxy tool that hosts a collection of independently 231 developed tools and provides a graphical interface to them [60]. Indeed, it enables wider 232 usage of the tools and independence from bioinformatics support. 233 Additionally, we found poor error management of several of the tools which further 234 limits their usage to bioinformatician experts. For example, we tested Annovar with and ApplyRecalibration, or RealignerTargetCreator and IndelRealigner depend on each 248 other in a linear order and thus can be merged into one function for the user. 249 We observed various types of documentation. We found documentations following 250 the guidelines of Lee [12]. For example, the documentation of SAMtools includes  Table 5. Scores of the analysed HTS tools on the scale defined in Table 2. 1: highest quality, 2: medium quality, 3: lowest quality  FastQC  1  1  1  1  1  1  3  Qualimap  1  2  1  1  1  3  3  RseQC  1  1  3  1  2  3 3 TopHat  2  2  2  1  2  3  3  Bowtie  3  1  1  1  1  3  3  Bowtie 2  ?  1  1  1  1  3  3  BWA  1  2  3  1  2  1  3  SAMtools  1  1  1  3  2  1  1  Picard  3  1  1  3  2  3  3  GATK  3  1  1  3  2  3  3  BedTools  1  1  1  1  1  2  1  HTSeq  1  1  1  1  1  3  3  Annovar  3  1  2  3  2  2  1  IGV  1  1  1  1  1  1  1 The low mark (score 3) of dependency management was mainly due to their 259 depencies on an external software which is not included in their release. For example,

260
TrimGalore! is a wrapper around Cutadapt and FastQC, but installation does not 261 include these two tools. This means that additional time and effort is required from the 262 user to install dependencies separately from a third party website. In some cases, such 263 as in TopHat, we found the requirement that another tool should be in the PATH, 264 requiring additional steps.

265
In line with the finding of Mangul et al. [4], the installation of the investigated tools 266 was longer than expected due to the lack of information or additional installation of 267 dependencies from third party websites. Furthermore, several of the tools requires root 268 privileges (e.g., STAR and GATK), which hinders the fast exploration and development 269 of applications, as clusters or even personal working computers might be managed by IT 270 personnel.

271
March 10, 2022 9/26 RQ1 We observed the following points about the characterization of tools: • Most investigated tools have good documentation and are maintained at the time of analysis • Several tools have score 3 (poor quality) on support, workflow integration, and usability. These shortcomings require additional time to spent and glue code to apply for the end user • Most investigated tools have score 3 (poor quality) on installation and dependency management due to the need for external help or third party tools. We expect this issue to be minimized with the usage of Bioconda   Table 6 summarises the results of the static code analysis for all mappers that we

279
MOSAIK had no tags at all. Therefore the latest commit was analysed for these three 280 projects instead.

281
The results for the number of files, number of functions, number of code smells, and 282 technical debt are all absolute. Therefore, it appears obvious that more code would  Table 7 contains the goodness of fit (R 2 ) and the 286 coefficients of the estimated line (slope and intercept) for each of these measurements. 287 The high values for goodness of fit (close to 1.0) confirm that both the number of     Table 6; per 1000 LOC MEGAHIT has the least code smells and lowest technical debt while Bowtie has the highest values for both. lines of code  • Most mappers score the best possible category in the SonarSource maintainability rating • The technical debt of mappers is independent of the project size (LOC) • Despite low overall debt ratios, the factor of 2.4 between lowest and highest ratio is significant for absolute technical debt Since our observations about the process quality are entirely based on publicly available 326 artefacts of the development process, this analysis has to be limited to tools for which 327 such artefacts are available. This means that tools that provide their source code but do 328 March 10, 2022 13/26 not have a publicly available version control system (i.e., an accessible git history) could 329 not be analysed for process quality. This concerns older versions of the tools from 330 Table 3 which were only hosted on SourceForge before moving to GitHub. The results 331 of this analysis are summarised in Table 8. It shows the observed values for the different 332 process quality factors described in the Development activity Section. An important result regarding the process quality is the ratio of contribution per 334 author to the various repositories (Fig. 3). The contributors are anonymised with a 335 capital letter and an ordinal number (e.g., A1, A2, T1). issues. This is a significant risk that should be considered when choosing any tool.

353
Another interesting result in Table 8

362
The data show more continuity in the work on STAR (Fig. 5b)

393
Since software releases and tags are a main condition for the planned study of evolving 394 maintainability, only the 4 projects with sufficient tags (# of tags ≥ 20) were candidates 395 for this part of the study. Seeing their low technical debt ratios (see Table 6) and 396 similarity in the observed process parameters (see Table 8), we decided to once more 397 analyse MEGAHIT, STAR, Bowtie2 and Salmon for this research question. However, 398 we encountered build issues during the data collection of Salmon, thus did not include it. 399 The results of the static code analysis of the differently tagged versions of these overall debt ratio of the project was significantly decreased with the changes.

412
In the remaining analysed releases the LOC were generally slightly reduced and so 413 were the code smells. The debt ratio has another small jump at version v1.

RQ3:
There is no single pattern in which the maintainability of mappers evolves.
We have observed a mapper that kept growing its code base and increased the technical debt in the process in STAR. We have observed a mapper that has not evolved much since the code is available on GitHub in Bowtie2. And we have observed a mapper that increased its code base at distinct times and decreased its technical debt through heavy refactoring later in the project in MEGAHIT. We found that installation and dependency management of several bioinformatics tools 443 are poor, which should motivate the community for using conda [64] or other Based on these findings, we think the most effort should be focused on improving 455 the error management of the tools by integrating solutions for the most common issues 456 discussed on issue pages and independent forums. We also suggest improving the user 457 experience of documentation based on the guidelines of [12]. As, ultimately, the 458 learnability of these tools are hindered by these shortcomings, we suggest the inclusion 459 of bioinformatics students into the maintenance process in an iterative fashion. We  MEGAHIT (seeing the gaps in activity shown in Figure 5a), this debt is more 471 significant than the 2.3 % debt ratio reveals. Additionally, with an effective bus count of 472 1, the project is fully dependent on that single developer and their continued support. 473 The discussion on the maintainability evolution is focused on the collected data and 474 therefore limited to versions available on GitHub Most of the analysed mappers are not employing either of these practices. A few (e.g. 509 Salmon) have a test suite but the coverage is usually very limited. Especially the first 510 two bullet points can be achieved with little effort for the expected reward. However, the 511 one measure that our analysis shows to be very effective is refactoring. MEGAHIT is 512 the mapper with the lowest debt ratio in Table 6 and this is not due to its low debt ratio 513 throughout the development. This first place was achieved through extensive refactoring 514 between the two versions. Between these versions the technical was reduced by 6 days 515 while almost 9000 LOC were added. Had those 9000 lines been added without improved 516 debt ratio, the technical debt would have been 25 days higher than it actually was with 517 the refactoring. This shows that a dedicated refactoring effort -like the work done on 518 MEGAHIT over a period of 5 month -is realistic in bioinformatics tools and can have a 519 March 10, 2022 20/26 significant impact on the maintainability of a project. In their systematic review of 520 research on code smells and refactoring, Lacerda et al. also come to the conclusion that 521 refactoring should be the first measure when attempting to reduce technical debt [67]. 522 Improving the Development Process

523
The origin of the term "bus count" suggests that a high number of active developers 524 should be an ambition of any open source software tool. More active developers who are 525 familiar with the project does not only mean that a project can be continued even if one 526 of them stops to work on it. We also expect that projects with more active developers 527 should be faster in their response to the issues being reported. This assumption was 528 however not confirmed by the data. Neither the ratio of open issues nor the average 529 time to close them was found to generally be better for projects with 3 or 4 major 530 contributors compared to those with only 1 or 2. These values turned out to be different 531 on a case by case basis.

532
A further advantage of a larger number of involved contributors is that it allows for 533 the implementation of pull requests as a mean of peer reviewing code. According to 534 Silva et al. pull requests can be a means of reducing technical debt continuously 535 throughout the development process if used in the right way [68]. The low number of 536 major contributors combined with the additional effort required for proper code reviews, 537 however, makes this an unrealistic solution in the given scenario. An automated 538 approach without the human factor is recommended instead. We therefore recommend 539 adding the following tools and steps to the development process: when an incompatible API change is made, MINOR increments when a new feature is 549 added which is backward compatible to previous changes, and PATCH increments when 550 a bug is fixed in a backward compatible way. Continuous integration is the process of 551 iteratively adding new code to a working code base, while making sure that no new code 552 is causing breaking changes.

553
These are some easy steps that can be applied even to projects run by a single 554 developer. They will help reduce technical debt, keep the users and co-developers 555 updated about changes and compatibility between releases, and prevent publishing of 556 changes that break tests or build procedures.

558
The data collected in this research shows that bioinformatics mappers are generally at a 559 good software quality and maintainability level. However, code quality shows a trend of 560 degradation over time, which can be reversed with a conscious effort of refactoring. The 561 development of the investigated software is usually driven by very few major developers, 562 creating a strong dependency on those developers' commitment to the projects. This 563 not only results in varying success of handling issues in the code base, but hinders 564 refactoring efforts too. We therefore recommend a set of practices that can easily be 565 implemented even in projects with a single major contributor and should help to 566 steadily and permanently improve the maintainability of open source mappers and other 567 tools for scientific computation.

568
With the continuous development of scientific software, we would like to see further 569 research into the implementation and effects of the recommended improvements. The 570 tooling we provide makes the collection of future data on the subject very easy and we 571 hope it can be used to assess the future development of the analysed mappers.