Summary
Transcriptomic data is accumulating rapidly; thus, development of scalable methods for extracting knowledge from this data is critical. We assembled a top-down transcriptional regulatory network for Escherichia coli from a 1035-sample, single-protocol, high-quality RNA-seq compendium. The compendium contains diverse growth conditions, including: 4 temperatures; 9 media; 39 supplements, including antibiotics; and 76 unique gene knockouts. Using unsupervised machine learning, we extracted 117 regulatory modules that account for 86% of known regulatory network interactions. We also identified two novel regulons. After expanding the compendium with 1675 publicly available samples, we extracted similar modules, highlighting the method’s scalability and stability. We provide workflows to enable analysis of new user data against this knowledge base, and demonstrate its utility for experimental design. This work provides a blueprint for top-down regulatory network elucidation across organisms using existing data, without any prior annotation and using existing data.
Highlights
- Single protocol, high quality RNA-seq dataset contains 1035 samples from Escherichia coli covering a wide range of growth conditions
- Machine learning identifies 117 regulatory modules that capture the majority of known regulatory interactions
- Resulting knowledge base combines expression levels and module activities to enable regulon discovery and empower novel experimental design
- Standard workflows provided to enable application of knowledge base to new user data

Competing Interest Statement
The authors have declared no competing interest.
Footnotes
This manuscript has been modified to reflect an updated/expanded version of the underlying dataset. The analyses in the manuscript have been repeated on this larger dataset. Certain figures have been modified for clarity. A section on use cases for the dataset has also been added.