Abstract
With Next Generation Sequencing Data (NGS) coming off age and being routinely used, evolutionary biology is transforming into a data-driven science.
As a consequence, researchers have to rely on a growing number of increasingly complex software. All widely used tools in our field have grown considerably, in terms of the number of features as well as lines of code. In addition, analysis pipelines now include substantially more components than 5-10 years ago.
A topic that has received little attention in this context is the code quality of widely used codes. Unfortunately, the majority of users tend to blindly trust software and the results it produces. To this end, we assessed the code quality of 15 highly cited tools (e.g., MrBayes, MAFFT, SweepFinder etc.) from the broader area of evolutionary biology that are used in current data analysis pipelines.
We also discuss widely unknown problems associated with floating point arithmetics for representing real numbers on computer systems. Since, the software quality of the tools we analyzed is rather mediocre, we provide a list of best practices for improving the quality of existing tools, but also list techniques that can be deployed for developing reliable, high quality scientific software from scratch.
Finally, we also discuss journal and science policy as well as funding issues that need to be addressed for improving software quality as well as ensuring support for developing new and maintaining existing software.
Our intention is to raise the awareness of the community regarding software quality issues and to emphasize the substantial lack of funding for scientific software development.