RT Journal Article SR Electronic T1 Symbiont-Screener: a reference-free filter to automatically separate host sequences and contaminants for long reads or co-barcoded reads by unsupervised clustering JF bioRxiv FD Cold Spring Harbor Laboratory SP 2020.10.26.354621 DO 10.1101/2020.10.26.354621 A1 Mengyang Xu A1 Lidong Guo A1 Chengcheng Shi A1 Xiaochuan Liu A1 Jianwei Chen A1 Xin Liu A1 Guangyi Fan YR 2020 UL http://biorxiv.org/content/early/2020/10/26/2020.10.26.354621.abstract AB Decontamination is necessary for eliminating the effect of foreign genomes on the symbiont studies and biomedical discoveries. However, direct extraction of host sequencing reads with no references remains challenging. Here, we present a triobased method to classify the host error-prone long reads or sparse co-barcoded reads prior to assembly, free of any alignments against DNA or protein references. This method first identifies high-confident host reads by haplotype-specific k-mers inherited from parents, and then groups remaining host reads by the unsupervised clustering. Experimental results demonstrated that this approach successfully classified up to 97.38% of the host human long reads with the precision rate of 99.9999%, and 79.95% host co-barcoded reads with the precision rate of 98.36% using an artificially mixed data. Moreover, the tool also exhibited a good performance on the decontamination of the real algae data. The purified reads reconstructed two haplotypes and improved the assembly with larger contig NGA50 value and less misassemblies. Symbiont-Screener can be freely downloaded at https://github.com/BGI-Qingdao/Symbiont-Screener.Competing Interest StatementAuthors are employees of BGI Group.