TY - JOUR T1 - Beware the Jaccard: the choice of metric is important and non-trivial in genomic colocalisation analysis JF - bioRxiv DO - 10.1101/479253 SP - 479253 AU - Stefania Salvatore AU - Knut Dagestad Rand AU - Ivar Grytten AU - Egil Ferkingstad AU - Diana Domanska AU - Lars Holden AU - Marius Gheorghe AU - Anthony Mathelier AU - Ingrid Glad AU - Geir Kjetil Sandve Y1 - 2019/01/01 UR - http://biorxiv.org/content/early/2019/03/04/479253.abstract N2 - Background The generation and systematic collection of genome-wide data is ever-increasing. This vast amount of data has enabled researchers to study relations between a variety of genomic and epigenomic features, including genetic variation, gene regulation, and phenotypic traits. Such relations are typically investigated by comparatively assessing genomic co-occurrence. Technically, this corresponds to assessing the similarity of pairs of genome-wide binary vectors. A variety of metrics have been proposed for this problem in other fields like ecology. However, while several of these metrics have been employed for assessing genomic co-occurrence, their appropriateness for the genomic setting has never been investigated.Results We show that the choice of metric may strongly influence results and propose two alternative modelling assumptions that can be used to guide this choice. On both simulated and real genomic data, the Jaccard index is strongly affected by dataset size and should be used with caution. The Forbes coefficient (fold change) and tetrachoric correlation are less affected by dataset size, but one should be aware of increased variance for small datasets.Availability All results on simulated and real data can be inspected and reproduced at https://hyperbrowser.uio.no/sim-measure ER -