Abstract
Biologists more and more have to deal with objects with non-numeric descriptions: texts (e.g. genetic sequences or even whole genomes), graphs, images, etc. There even could be no variables or descriptions at all when variability of objects is defined by similarity matrix. It is also possible to have too many variables (e.g. a magnitude of millions is reachable in mass spectrometry or genome research). In this case it is necessary to switch to object similarity matrices which drastically reduces dimensionality to hundreds or thousands. It is software developer’s responsibility to keep this use cases in mind and provide means for working with such data instead of shifting the problem to the users. Software should be more convenient for them and allow solving wider range of problems with fairly simple mathematical apparatus. In particular principal component analysis (PCA) is rather popular among biologists. But, the necessity of variables is an illusion. It’s enough to have a matrix of Euclidean distances between objects and apply method of the principal coordinates (PCo) (or multidimensional scaling for dissimilarity matrix, MDS) [1].
In the late 70s of the last century B. Efron proposed generating a set of new samples from the source sample EDF as a model for sample’s general distribution to get confidence estimation. He called it “bootstrap” [2]. For the statistical software developers this primarily means that PCo, MDS, and bootstrap should be implemented. Further, the use of bootstrap results in huge increase of repetitions of data analysis (from hundreds to millions of times) which is impossible to do in interactive mode. Therefore a part of the analysis requiring bootstrap should be written as a script in its entirety. Further user interaction should be eliminated. Obviously this process could be efficiently done in parallel.
There are multitude of tools for doing it varying from scripting languages like R or Python to specialized software packages like PAST, CANOCO, Chemostat, STATISTICA, and MATLAB. Researchers who are not versed in software development tend to use tools like PAST, even if they may not cover all their needs, including automating frequently performed tasks. However, automatic analysis is a key element for the upcoming era of bootstrap analysis.
We developed a simple and convenient package JACOBI4, which allows researchers without programming experience to automate multidimensional statistical analysis. Package and methods implemented in it can be useful in studies of both medical (gene expression for various diseases) and biological (regularities of molecular sequence variability) data. It goes without saying that the use of JACOBI4 is in no way limited to these examples. The package can be used directly, taking already developed scripts and editing them to fit own needs. Package JACOBI4 is freely available at [w1]. There are also articles available in which JACOBI4 is used to process real world data, as well as supplemental files containing JACOBI4 scripts and data for them.