PT - JOURNAL ARTICLE AU - Robert C. Edgar AU - Jeff Taylor AU - Victor Lin AU - Tomer Altman AU - Pierre Barbera AU - Dmitry Meleshko AU - Dan Lohr AU - Gherman Novakovsky AU - Benjamin Buchfink AU - Basem Al-Shayeb AU - Jillian F. Banfield AU - Marcos de la Peña AU - Anton Korobeynikov AU - Rayan Chikhi AU - Artem Babaian TI - Petabase-scale sequence alignment catalyses viral discovery AID - 10.1101/2020.08.07.241729 DP - 2021 Jan 01 TA - bioRxiv PG - 2020.08.07.241729 4099 - http://biorxiv.org/content/early/2021/03/14/2020.08.07.241729.short 4100 - http://biorxiv.org/content/early/2021/03/14/2020.08.07.241729.full AB - Public databases contain a planetary collection of nucleic acid sequences, but their systematic exploration has been inhibited by a lack of efficient methods for searching this corpus, now exceeding multiple petabases and growing exponentially [1, 2]. We developed a cloud computing infrastructure, Serratus, to enable ultra-high throughput sequence alignment at the petabase scale. We searched 5.7 million biologically diverse samples (10.2 petabases) for the hallmark gene RNA dependent RNA polymerase, identifying well over 105 novel RNA viruses and thereby expanding the number of known species by roughly an order of magnitude. We characterised novel viruses related to coronaviruses and to hepatitis δ virus, respectively and explored their environmental reservoirs. To catalyse a new era of viral discovery, we established a free and comprehensive database of these data and tools. Expanding the known sequence diversity of viruses can reveal the evolutionary origins of emerging pathogens and improve pathogen surveillance for the anticipation and mitigation of future pandemics.Competing Interest StatementThe authors have declared no competing interest.