Dual RNA sequencing (dRNA-Seq) of bacteria and their host cells

Bacterial pathogens subvert host cells by manipulating cellular pathways for survival and replication; in turn, host cells respond to the invading pathogen through cascading changes in gene expression. Deciphering these complex temporal and spatial dynamics to identify novel bacterial virulence factors or host response pathways is crucial for improved diagnostics and therapeutics. Dual RNA sequencing (dRNA-Seq) has recently been developed to simultaneously capture host and bacterial transcriptomes from an infected cell. This approach builds on the high sensitivity and resolution of RNA-Seq technology and is applicable to any bacteria that interact with eukaryotic cells, encompassing parasitic, commensal or mutualistic lifestyles. We pioneered dRNA-Seq to simultaneously capture prokaryotic and eukaryotic expression profiles of cells infected with bacteria, using in vitro Chlamydia-infected epithelial cells as proof of principle. Here we provide a detailed laboratory and bioinformatics protocol for dRNA-seq that is readily adaptable to any host-bacteria system of interest.


Introduction
Background Upon infection or other interactions, bacteria and their host eukaryotic cells engage in a complex interplay as they negotiate their respective survival and defense strategies. Unraveling these coordinated regulatory interactions, virulence mechanisms, and innate responses is key for our understanding of pathogenesis, disease and the development of therapeutics [1]. Traditional transcriptomic approaches such as microarrays have typically focused on either the prokaryotic or eukaryotic organism to investigate the host-bacteria interaction network [2]. However, this approach cannot decipher reciprocal changes in gene expression that contribute to the global infection system.
Instead, an integrated approach is required that acknowledges both interaction partners, i.e. both bacteria and host, from the same biological sample. Due to the increasing affordability and resolution of next-generation sequencing, this is now achievable via dual RNA sequencing (dRNA-Seq) [1]. integrated with other data sources to establish a more complete picture of gene regulation, including genotyping data to identify genetic loci responsible for gene expression variation, epigenetic information to highlight the influence of transcription factor binding, histone modification, and methylation, and miRNA-Seq data to identify the regulatory mechanisms of gene expression changes via non-coding RNA [28,29].

Experimental design
For RNA-Seq, a typical workflow includes experimental design, RNA extraction, library preparation, sequencing, and data analysis. The experiment should be designed to address the biological question(s) of interest and a key starting point is to determine the RNA species to be investigated (i.e. mRNA, Dual RNA sequencing (dRNA-Seq) of bacteria and their host cells 6 miRNA, snRNA etc). This will influence the quantity of input RNA and sequencing depth required, which are both crucial factors for successful dRNA-Seq experiments. To capture sufficient RNA from both organisms, the ratio of bacteria to host genome size is a useful starting point followed by an estimation of the desired fold coverage. This can be determined by considering the number of replicates, the expected influence of housekeeping and structural RNA (rRNA and tRNA), the possibility of host:bacteria sequence overlap, the number of time-points, and the multiplicity of infection (MOI). We suggest at least three biological replicates for each sample rather than technical replicates taken from the same sample to minimize Type I and II errors and ensure an adequate estimation of within and between sample variation. As ~95% of total RNA will be ribosomal, a method of rRNA depletion and/or mRNA enrichment is recommended and several options are discussed below.
Sequence overlap between host and bacteria can be predicted by mapping the bacterial sequence reads to the host genome and vice versa, which is also discussed below. The time-points of interest should be carefully considered as the initiation and period of transcriptional response can differ between host and bacteria [30]. Ideally, multiple time-points should be collected to suitably capture the dynamic hostbacteria transcriptional landscape. Finally, a suitable bacteria MOI should be selected to maximize the transcriptional signal from both host and bacteria, while reducing bias towards the uninfected cells that will flourish at the later time-points. Importantly, a high(er) MOI may also lead to a heightened and/or distorted host response with decreased biological relevance, depending on the system under investigation. Deeper sequencing may be necessary for the detection of low copy number transcripts or alternate isoforms, however increased sequencing depth can also increase the detection of transcriptional noise, spurious cDNA transcripts, or genomic DNA contamination so careful consideration is required [31,32]. Optionally, the addition of RNA spike-ins and unique molecular identifiers (UMI) can be useful for the quantitative calibration of RNA levels [33,34].
This protocol describes the collection and analysis of protein-coding mRNA from C. trachomatis and Hela cells at 1 and 24 hours post-infection (hpi) time-points as they are physiologically relevant to the early and mid-stages of the chlamydial development cycle. An MOI of 1.0 is used to maximize chlamydial RNA recovery, however a limited quantity of bacterial RNA will be present at 1 hpi as replication would not yet have commenced. This caveat should be considered when estimating RNA quantities required. It is recommended that ~ 5 ´ 10 8 host reads and > 1 ´ 10 6 bacterial reads are Dual RNA sequencing (dRNA-Seq) of bacteria and their host cells 7 required for adequate coverage [4,35]. The Chlamydia to Hela genome size is ~1:3200 MB, indicating that Chlamydia's RNA accounts for ~0.03% of total host-bacteria RNA. As ~95% of this will be uninformative rRNA and tRNA [4], 0.0015% and 4.9985% of total RNA will represent informative bacteria and host mRNA, respectively. Given this ratio, 1 ´ 10 10 host reads and ~3.33 ´ 10 9 bacterial reads would be required to capture sufficient RNA from both organisms. Thus, to achieve sufficient coverage overall, > 1 ´ 10 10 reads would be required for dRNA-Seq of Chlamydia and host.

Infection
Host cells are cultured in Dulbecco's modified Eagle's medium (DMEM) (see Materials section) and are infected with C. trachomatis at a multiplicity of infection (MOI) of 1 to ensure that 100% of host cells are infected. As cycloheximide (an inhibitor of protein synthesis used to maximize chlamydial yields in vitro), is not used throughout this experiment, so the host cells are seeded at ~60% confluency at the time of infection to ensure continued host cell viability throughout the time-course of the study.
Given the time-sensitive nature of the experiment, it is critical to synchronize the initial infection by centrifugation, followed by the removal of dead or non-viable bacterial cells by washing twice with DPBS. HeLa and Chlamydia cells are harvested into sucrose phosphate glutamate (SPG) media, and frozen at -80°C prior to RNA extraction.

RNA Preparation
To ensure high-quality data, dRNA-Seq typically requires a relatively large amount of input RNA.
Extreme care must be taken to prevent DNA contamination or RNA degradation, which can be minimized by adhering to the protocol time and temperature requirements, purchasing highly pure and RNase-free reagents, and using RNA-free equipment and consumables. Always conduct RNA work in a clean environment that is partitioned from non-RNA work. Wherever possible we routinely use commercially available kits due to their reliability and reproducibility, however we have carefully optimized the manufacturer's instructions to suit this protocol.
Dual RNA sequencing (dRNA-Seq) of bacteria and their host cells 8 Total nucleic acid is extracted using a MasterPureä RNA Purification Kit (Epicenter) according to the manufacturer's instructions. At this stage it is critical to minimize the delay between host cell lysis and RNA extraction to avoid unwanted degradation. HeLa cells are lysed, host proteins digested, and total nucleic acid precipitated with isopropanol. It is critical to ensure the complete removal of contaminating DNA before proceeding. We have found that two treatments with TURBO DNA-freeÔ DNase (Thermo Fisher) is most effective. We perform three real-time qPCR assays for human targets and one endpoint PCR assay for C. trachomatis to confirm DNA removal. The qPCR assays are based on TaqManâ Gene Expression assays (Applied Biosystems) with primer and probe sets targeting beta actin, mitochondrially encoded ATP synthase 6, and eukaryotic 18S rRNA [1]. The endpoint PCR is based on custom-designed primer sets that are specific for C. trachomatis, which were designed using PrimerExpress software (Applied Biosystems).
As rRNA constitutes >95% of total RNA, a method of rRNA depletion should be considered to maximize the recovery of mRNA and reduce the sequencing depth required. There are a number of commercial kits available for nuclease digestion and size-selection; this protocol utilizes both a hybridization-based rRNA depletion and poly(A)-depletion step. For hybridization, cDNA oligonucleotides attach to complementary rRNA that is immobilized on magnetic beads; always ensure that the oligonucleotides are compatible with your organism(s) of interest. For this, we combine an equivalent volume of Ribo-Zero beads from both a Human/Mouse/Rat-specific and Gram-negative bacteria-specific Ribo-ZeroÔ rRNA Removal Kit (Epicenter), enabling simultaneous elimination of both host and bacterial rRNA. It is important to note that this method will not enrich immature mRNAs and non-coding RNAs; specific target enrichment techniques that are outside the scope of this protocol should be considered if these are of experimental interest. Aliquots of rRNA-reduced samples may be then subjected to poly(A) depletion to further enrich host mRNA transcripts and separate mRNA from rRNA contaminants. Poly(A)-depleted and rRNA-depleted eluates can also be further purified before being combined for library construction. The remaining RNA is concentrated and purified with a RNA Clean & Concentratorä-5 kit (Zymo Research). A Bioanalyzer (Agilent) is used to examine the concentration and quality of purified RNA via a capillary electrophoresis-based system. Dual RNA sequencing (dRNA-Seq) of bacteria and their host cells 9

Library Preparation and Sequencing
Prior to sequencing, total RNA is converted to cDNA [36]. There are a number of sequencing platforms currently available, including Illumina, SOLID, Ion Torrent, Roche 454, Nanopore, and Pacific Biosciences, and each are suited to specific purposes and should be investigated by the user according to their desired outcome. This stage of the (d)RNA-Seq protocol (library preparation and sequencing) is often outsourced to a commercial enterprise or central sequencing facility so the steps involved are outside the scope of this protocol. Each facility will provide detailed instructions on the quality and quantity of RNA required and the sample preparation guidelines. Nevertheless, here we provide some general guidelines based on our experience.
We generally use the TruSeq Sample Prep Kit for library preparation and sequencing by the Illumina platform. For this, the mRNA is chemically fragmented and primed with random hexamer primers.
First-strand cDNA synthesis occurs using reverse transcriptase, followed by second strand cDNA synthesis using DNA polymerase I and RNase H. The cDNA is purified and end-repaired and 3' adenylated. Adapters containing six nucleotide indexes are ligated to the double-stranded cDNA, which is purified with AMPure XT beads (Beckman Coulter) and enriched via polymerase chain reaction (PCR) amplification. While we suggest that paired-end reads > 50 nucleotides will promote increased fragment randomization and is a good guideline, longer reads will enable greater coverage, reduced multi-mapping, and improved transcript identification [37].

Data preparation
Sequence data from dRNA-seq comprises cDNA as input from the experiment, with the majority derived from the eukaryotic host (depending on the experimental conditions and system under study).
Thus, careful attention is required to accurately segregate the reads from each organism. For paired-end sequencing, host and bacteria read data is generally provided as two FASTQ format files, which are comprised of a unique read identifier, the sequence read, an optional alternate identifier, and the quality scores for each read position. These are examined for possible sample contamination by screening total reads against a sequence database with FastQ Screen (http://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/) ( Figure 3). The reads are then checked for quality using FASTQC, a Java-based software that reports several quality control statistics and a judgment on each metric (pass, warn, fail) ( Figure 4) [36]. Short reads, low quality reads, and adapters are then removed with Trimmomatic [38]. Other available QC tools available include PRINSEQ [39] and FastX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/).
The organisms of interest and experimental question will dictate which mapping software is most appropriate; we currently use HISAT2 [40], a powerful yet efficient program capable of identifying the splice junctions between exons that are characteristic of eukaryotic data, while the short-read aligner, Bowtie2 [41], is sufficient for bacterial read mapping. Other non-splice-aware aligners for bacterial reads include SEAL [42] and SOAP2 [43], and alternative splice-aware aligners for host reads include MapSplice [44], STAR [45], and Tophat2 [46]. It is important to note that read aligners are an active area of research, with new tools and updates frequently appearing [47].
The combined host and bacteria reads are mapped to the host reference genome with specific settings to preserve unmapped (bacterial) reads, which are then mapped to the bacteria reference genome.
Assembly to a reference transcriptome is also possible, but this relies on the accuracy of annotated gene models which may restrict the discovery of novel genes and isoforms; this approach is more suitable when there is no reference genome available. If available for the organisms of interest, reference genomes and the annotation file can be obtained from either NCBI [48], UCSC [49], or Ensembl [50].
Each repository formats these files slightly differently so it is important to obtain both files from the same source. This protocol utilizes the GRCh37 release of the Homo sapiens genome and annotation file from Ensembl and the Chlamydia trachomatis serovar D genome from NCBI (NC_000117.1).
The resulting alignment files for both host and bacteria are sorted by position (i.e. chromosomal location) with Samtools [51] to produce alignment quality statistics, including the number of mapped reads, the number of mapped first mates and second mates (for reads from paired-end sequencing), reads with multiple hits in the genome, and host reads mapping to exonic, intronic, and intergenic regions. Ideally, > 70% of host reads should map to exonic regions of the genome while less than 5% of reads mapping to intronic regions and less than 1% of reads mapping to intergenic regions [52]. The Dual RNA sequencing (dRNA-Seq) of bacteria and their host cells 11 alignment file is further converted to a BigWig format for the visualization of the number of reads aligned to every single base position in the genome using Integrated Genome Viewer (IGV) ( Figure 5) [53], or other visualization tools such as UCSC Genome Browser or JBrowse [54]. Using IGV, the coverage of aligned reads across the genome for both host and bacteria can be visualized to identify genomic regions of high/low coverage that could indicate technical or biological errors, as well as host exon-intron boundaries, splice sites, exon junction read counts, and read strand [55]. The alignment files are then sorted by read name to facilitate feature counting.

Feature counting and normalization
Both host and bacteria counts are generated from their respective alignment files using the python wrapper script htseq-count from the HTSeq package [56]. This process quantitates the number of reads that align to a biologically meaningful feature such as exons, transcripts, or genes [56], and is guided by the reference annotation file. This protocol describes the quantification of reads on a gene level, where a gene is considered the union of its exons. Any reads that map to several genomic locations are automatically discarded by HTSeq and we generally take a conservative approach to also discard reads that overlap with more than one gene.
Sample read counts are collapsed into a single file containing a matrix of genes (rows) and samples (columns), with one file each for host and bacteria ( Figure 6). To minimize statistical noise and enable better adjusted p-values, the matrix is pre-filtered so that greater than three counts remain in more than two of the samples [36]. Additionally, the last five lines of the matrix containing statistics for ambiguous counts from htseq-count are removed. Raw counts are normalized to minimize technical bias due to transcript length and sequencing depth; there are several normalization methods available, including Reads Per Kilobase Per Million (RPKM) [20], EDASeq [57], Conditional Quartile Normalization (CQN) [58], Upper Quartile (UQ) [59], and Transcripts Per Million (TPM) [60], and each have their benefits. For the host counts we use Trimmed Mean of M-Values (TMM) which corrects for differences in RNA composition and sample outliers, while providing better across-sample comparability [61].

Data analysis
Prior to differential expression analysis, both a Multi-Dimension Scaling (MDS) plot and hierarchical cluster plot is constructed to visualize the distances between samples and help identify problematic and outlying samples (Figures 7 and 8). A metadata table is generated to list the experimental variables, which in turn guides the construction of design and contrast matrices, which are mathematical representations of the experimental design and a description of the relevant treatment comparisons, respectively. This protocol describes a simple design and contrast matrix that allows differential expression comparisons to be made between infected and uninfected cells within each time-point.
Additional time-points and other experimental factors would lend themselves to more complex design matrices and may be added by the user if required.
Due to the specific nature of the host and bacteria count data, distinct statistical analyses are required for each. For bacterial transcripts, TPM is the most appropriate measure of relative transcript abundance, but this approach can suffer from biases where the calculated abundance of one transcript can affected other transcripts in the sample. Alternatively, absolute abundance may be calculated with the use of spike-in controls. Whichever method is chosen, these abundances represent a descriptive characterization of the Chlamydia transcriptome at the 1-and 24hpi time-points, which is moderately informative on its own, but when analyzed in combination with the interacting host response can allow deeper insight into the host-bacteria interactome.
For the host, differential expression analysis is performed to identify genes that have changed significantly in transcript abundance between infected and time-matched non-infected samples [5].
There are multiple differential expression packages available, each with advantages, including BaySeq [62], Cufflinks [63], DESeq [64], edgeR [65], Salmon [66], and Kallisto [67]. Protocols for the Cufflinks suite and edgeR/DESeq packages have been described previously [5,68]. The majority of these packages model RNA-Seq counts via a negative binomial (NB) distribution and apply various approaches to calculate reliable dispersion estimates. Alternatively, this protocol describes the use of the Limma package which uses linear modeling to describe the expression data for each gene (Limma stands for "Linear Modeling for Microarray Data") [69]. In contrast to these other RNA-Seq packages, Dual RNA sequencing (dRNA-Seq) of bacteria and their host cells 13 Limma attempts to correctly model the mean-variance relationship between samples to achieve a more probabilistic distribution of the counts ( Figure 9). This has proven to be the best method for analyzing both simple and complex experimental designs of dRNA-seq experiments that incorporate different sample types and time-points [70].
The output for the host is a set of differentially expressed genes, applying a False Discovery Rate (FDR) cutoff ≤ 0.05 (i.e. 5% false positives), and at least two-fold up/down-regulation and expression levels greater than 1 percentile in either condition (Table 1). These lists can then be used as input for downstream analysis of the enrichment of gene ontology and metabolic pathways using several tools, including GOSeq [71], DAVID [72], and Ingenuity Pathway Analysis (IPA) Toolkit (QIAGEN Redwood City, www.qiagen.com/ingenuity).

Integration of host and bacterial transcriptomic data
The final and most powerful stage of a dRNA-seq experiment is the integration of host and bacterial data to draw biological conclusions that could not have been obtained by investigating each organism in isolation. Following Gene Ontology and pathway analyses, interacting host-bacteria responses can be identified, within and between time-points, to highlight pathogenic mechanisms in the bacterium and the reciprocal regulatory patterns, responses, and transcriptional reprogramming of the host (depending on the system under investigation).
Application dRNA-Seq can be used to address a number of experimental questions. Host differential mRNA and miRNA expression, differential exon usage, alternative splicing, and novel transcript and isoform discovery in response to the bacteria can be determined [5], which may be correlated with the transcriptomic response of the bacteria to determine interaction dynamics. These results can be further integrated with other sources of biological input, including genotyping data and epigenetic information (transcription factor binding, histone modification, methylation etc.) to contribute to a systems level definition of host regulatory mechanisms. Following seeding, sit plates on a bench at room temperature for 15 minutes to allow cells to settle, ensuring an even distribution of cells.

2.
The next day, infect all plates (except mock-infected control plates) at an MOI of 1.0 to ensure that 100% of the host cells will be infected.

3.
Centrifuge plates at 500 x g for 30 minutes at room temperature. Incubate plates at 37°C, 5% Centrifugation is important to synchronize the infections and time-points.

5.
At each time-point, wash cells twice with DPBS and add 1mL DPBS to each well. Harvest cells with a cell scraper and dispense solution into a 15 mL centrifuge tube. Store tubes at -80°C until all time-points are complete. Cells can be stored for up to six weeks at -80°C.

6.
Remove centrifuge tubes from -80°C freezer and thaw at room temperature.

7.
Pre-set heating block to 65°C. Epicenter) to each 300 µL of lysed sample and vortex vigorously for 10 seconds.

15.
Pellet the debris by centrifugation at 10,000 x g for 10 minutes at 4°C.

16.
Transfer the supernatant (containing total nucleic acid) to a clean 1.5 mL microcentrifuge tube and discard the pellet.

17.
Add 500 µL of isopropanol to the recovered supernatant and invert the tube 30-40 times. Do not vortex.

18.
Pellet the total nucleic acid by centrifugation at 10,000 x g for 10 minutes at 4°C.

19.
Carefully pour off the isopropanol without dislodging the pellet.

27.
Centrifuge tubes at 10,000 x g for 1.5 minutes.

28.
Transfer supernatant (containing the RNA) to a fresh 1.5 mL microcentrifuge tube.

29.
Remove 5 µL from each RNA sample and place in a clean 1.5 mL microcentrifuge tube.

30.
Add 95 µL of nuclease-free water (1:20 dilution).  25 The washed magnetic beads must be at room temperature for use in this step. The order of the addition (hybridized RNA to the magnetic beads) is critical for rRNA removal efficiency.

51.
Incubate microsphere wash tubes at room temperature for 10 minutes. Vortex at medium speed for 5 seconds, every 3-4 minutes.

52.
Following incubation, mix samples again by vortexing at medium speed for 5 seconds.

53.
Incubate samples in heating block at 50°C for 10 minutes.

54.
Transfer the RNA-microsphere suspension to a Microsphere Removal Unit (Ribo-Zeroä rRNA Removal kit; Illumina) and centrifuge at 12,000 x g for 1 minute at room temperature.
Save the eluate and discard the removal unit.
At this stage, the eluate should be ~100 µL. The minimum recommended sample volume for use with this kit is 50 µL.

56.
Add 1 volume of 100% ethanol to the mixture and mix well.

61.
Add 400 µL of RNA wash buffer to the column and centrifuge at 12,000 x g for 1 minute.
Discard the flow-through.

62.
Centrifuge the column in an emptied collection tube at 12,000 x g for 2 minutes. Carefully remove the column from the collection tube and transfer to a new RNase-free microcentrifuge tube.

63.
Add 20 µL of DNase/RNase-free water directly to one column matrix and let stand for 1 minute at room temperature. Centrifuge at 10,000 x g for 30 seconds.
The eluted RNA can be used immediately or stored at -80°C.

Library Preparation and Sequencing
The method of library preparation is dependent on the sequencing platform used and thus is outside the scope of this protocol. Commercial sequencing enterprises and sequencing centers will provide detailed guidelines on the library preparation and sample submission guidelines that begin with the isolation of pure mRNA as described above. Following sequencing, the user will be provided with a series of raw FASTQ sequence files that serve as the input for the following bioinformatics section. The analysis of dRNA-seq experiments is a computationally intensive process that requires the manipulation of gigabytes of data. Access to a computer cluster, core facility, or cloud service is recommended to expedite the analysis and free up resources on the local system.

Operating system
This protocol provides commands that are designed to run on a Unix-based operating system such as • Bowtie2: Short-read aligner (http://bowtie-bio.sourceforge.net/index.shtml).
Always check that you are downloading and installing the latest version of each piece of software and consult the official user guide for more in-depth guidelines and options for troubleshooting any errors that may arise.

Samples and filenames
The protocol is arranged so that identical naming conventions are used for each sample and condition. host and bacteria would have names ending in "rep2" and "rep3", for conciseness this protocol describes the commands using "1hpi_Host_infected_rep1" as an example and it is expected that the user will repeat the process for the remaining replicates and samples. The filenames associated with raw FASTQ sequence files will depend on the sequencing facility pipeline and in this protocol are named "fastq_file_1_R1.fq", where "R1" indicates read number one of paired end reads (the corresponding read file would be "fastq_file_1_R2.fq". In some cases an output directory is required, which is noted as "<output_directory>" for the user to input their working directory of choice (without the "<>" symbols). Finally, reference, annotation, and gene info files are prefixed with the relating organism, i.e. "host_reference.fa" indicates a FASTA file containing the host reference genome.

Equipment setup
Download and install the following software. Check the developer website to ensure you are installing the latest version and for further information about dependencies and prerequisites.

Create a directory to install program executables and add to PATH
$ mkdir $HOME/bin $ export PATH=$HOME/bin:$PATH $ echo "export PATH=$HOME/bin:$PATH" >> ~/.bashrc  In this command, "-noextract" tells FASTQC to not uncompress the output file, while "-o" defines the output directory. "fastq_file_1_R1.fq" is the FASTQ sequencing file. These commands produce a quality report with results saved to the directory defined by "<output_directory>". The results are reported in both illustrated form (the "fastqc_report.html" file) and text form (the "summary.txt" file).

FastQ-Screen installation
Repeat for all FASTQ files. As the FASTQ files are derived from total RNA sequencing, this step includes both host and bacterial sequences. See Figure 4.

3.
Remove sequencing adapters and low quality reads using Trimmomatic: Run this command from the Trimmomatic installation directory. The command specifes PE as pairedend data, 6 threads and the FASTQ files are encoded with Phred + 33 quality scores.
"fastq_file_1_R1.fastq.gz" and "fastq_file_1_R2.fastq.gz" specify the input FASTQ files to use. As paired-end data is inputted, four output files are needed to store the reads. Two 'paired' files from which both reads survived after processing, and two 'unpaired' files from which a single read survived, but the corresponding mate did not. "ILLUMINACLIP:adapters.fa" uses the "adapters.fa" file containing sequences and names of commonly used adapters to remove. "2:30:10" are three parameters used in the 'palindrome' mode of Trimmomatic to identify the supplied adapters, regardless of their location within a read. For a detailed description of the best use of these three parameters, consult the Trimmomatic manual. "LEADING:3" and "TRAILING:3" removes a base from either the start or end position if the quality is below "3". "SLIDINGWINDOW:4:15" performs trimming based on a sliding Dual RNA sequencing (dRNA-Seq) of bacteria and their host cells 35 window method. "4" is the window size and "15" is the required average quality. By examining multiple bases, if a single low quality base is encountered, it will not cause high quality data later in the read to be removed. Finally, "MINLEN:36" removes any remaining reads that are less than 36 bases long. Repeat for all FASTQ files. As above, this step includes both host and bacterial sequences.

4.
Build host transcriptome index and align host sequence reads to reference using HISAT2: This "-x" specifies the path to the index previously built by Bowtie2: "host_reference.index". The "un-conc" argument tells HISAT2 to write a fastq rile containing all unmapped reads ("pair1_unmapped.fastq"), and "-1" and "-2" specify the paired-end fastq file mates. The "samtools view -bS -" argument converts to the output file from SAM to BAM format. Ensure that the HISAT2 output files, "accepted_hits.bam" and "pair1_unmapped.fastq" are preserved in the working directory as these are required for bacterial read mapping.

Sort BAM files generated by HISAT2 by both name and position using Samtools:
$ samtools sort accepted_hits.bam -o 1hpi_Host_infected_rep1.sorted_position $ samtools sort -n accepted_hits.bam -o 1hpi_Host_infected_rep1.sorted_name The first command takes the "accepted_hits.bam" file and sorts it by position, with the output file called "1hpi_Host_infected_rep1.sorted_position". In the second command, "-n" tells Samtools to sort the "accepted_hits.bam" file by name, with the output file called "1hpi_Host_infected_rep1.sorted_name". Repeat for all BAM files. The first command indexes the reference genome by creating a "host_reference.fa.fai" output file. The second command extracts the first two fields (sequence ID and sequence length) to generate the "host_reference.genome" file. The third command generates a histogram illustrating alignment coverage according to the reference genome. The "-split" argument tells genomecov to take into account spliced BAM alignments (as we used the splice-aware aligner HISAT2 for the host reads), while the "-bg" argument tells genomecov to report genome-wide coverage in bedGraph format. "ibam 1hpi_Host_infected_rep1.sorted_position.bam" is the input file in BAM format, "-g host_reference.genome" is the reference genome in FASTA format, and "1hpi_Host_infected_rep1.sorted_position.bedGraph" is the output file in bedGraph format. The fourth command converts this bedGraph file to BigWig format for use with IGV (below).

Index the 'sorted by position' BAM file for visualization in IGV:
$ samtools index 1hpi_Host_infected_rep1.sorted_position.bam This command takes the "1hpi_Host_infected_rep1.sorted_position" bam file created above and creates an indexed "1hpi_Host_infected_rep1.sorted_position.bai" file for use with IGV.

Create count matrix with HTSeq:
$ htseq-count -s no -a 10 -r name -f bam 1hpi_Host_infected_rep1.sorted_name.bam Host_annotation.gtf > 1hpi_Host_infected_rep1.sorted_name.count This command calls the htseq-count python wrapper script which performs the gene-level counts. "-s no" indicates that the reads are unstranded and "-a 10" sets the minimum mapping quality for a read to be counted as 10. "-r name" indicates that the input file is sorted by name and "-f bam" indicates that this input file is in BAM format and named 1hpi_Host_infected_rep1.sorted_name.bam".
"host_annotation.gtf" is the GTF annotation file from Ensembl and "1hpi_Host_infected_rep1.sorted_name.count" is the output file produced. Repeat command for each BAM file, which will produce a series of text files counting the gene-level reads for each sample. The last five lines of each file contain a list of reads that were not counting due to alignment ambiguities, multi-mapping, or low alignment quality.
Dual RNA sequencing (dRNA-Seq) of bacteria and their host cells 38

Set working directory in R:
> R > data <-setwd(…) The first command opens R, and the second command sets the working directory to the folder location containing all the relevant files from step 12. Replace "…" with the complete path to this location, for example: setwd("/home/username/dRNA-Seq/HTSeq_counts").

Combine count files into a DGEList in R:
> library(edgeR) > counts.host <-readDGE(list.files(pattern = ".count"), data, columns = c(1,2)) A DGEList is an R object from the edgeR package that efficiently compiles the count dataset and experimental variables that is fed into subsequent downstream analyses. The first command loads edgeR into the current R workspace. The second command creates a variable called counts_host, which is a DGEList containing all data from columns 1 and 2 from all files in the current working directory ending in ".count". See Figure 6. It is often helpful to visualize the count matrix at this point to confirm that it is formatted correctly and that there are no errors. The second command provides the matrix dimensions, which is useful for determining the number of genes remaining following independent filtering.

17.
Apply TMM normalization to the raw counts:

Apply voom transformation to normalized counts:
> library(limma) > y <-voomWithQualityWeights(counts = counts.host, design = design, plot = TRUE) > dev.off() This command applies a voom transformation to the counts, by converting them to log-counts per million with associated precision weights [70]. We generally extend this by using the voomWithQualityWeights function, which applies sample-specific weights to down-weight any outlier samples. This can be especially useful if outliers were identified in the MDS plot constructed in step 21 (below). This function takes as input the normalized count matrix ("counts_host") and design matrix ("design") and outputs two quality control plots: an estimation of the mean variance relationship and containing the voom transformation plots. See Figure 9. The first two commands estimate expression fold changes and standard errors by fitting a linear model to each gene, using the comparisons defined by the contrast matrix ("contrasts"). The third command applies empirical Bayes smoothing to the standard errors to further weaken any outliers.
Additionally, a log fold change (LFC) threshold may be included by adding an "lfc = 2" argument, which would return all DEG with a log fold change in expression greater than two. See Table 1.

24.
Annotate the differentially expressed transcript tables with gene symbol, description, and type information: > library(org.Hs.eg.db) > gene.info <-select(org.Hs.eg.db, key = rownames(top_1hpi), keytype = "ENSEMBL", columns = c("ENSEMBL", "SYMBOL", "GENENAME")) These commands are derived from the Limma documentation [73] and extract gene annotation information stored in both the org.Hs.eg.db R package and the "host_gene.info" that was previously downloaded to the working directory to annotate the DEG list with gene symbol and gene name.
Repeat for the "Host_24hpi" DEG list.

25.
Write the annotated differentially expressed transcript The first command indexes the bacteria reference genome "-f bacteria_reference.fa" and generates the "bacteria_reference_index" output file. The second command performs the read mapping using the unmapped reads from the Host mapping step ("pair1_unmapped.fastq").
Sort the BAM files by both name and position, convert the "sorted by position" BAM files to BigWig format and visualize with IGV. Create count matrix with HTseq.
Combine the count files into a DGEList, remove the last five rows from the counts, filter counts to remove low expression genes, and inspect the counts for errors.

33.
Write TPM values to file:

RNA quality and quantity
The Bioanalyzer trace will provide an indication of sample quality, where large, well-defined and high molecular weight peaks are expected. Low molecular weight peaks with low definition are usually evidence of RNA degradation and the experiment should not proceed further if this is the case.

Raw sequence quality checks
Following the analysis of the FASTQ files with FASTQC (step 2), a HTML report is generated which provides a judgement on several sequence quality parameters. Of particular importance are the following: The per base sequence quality plot should indicate a lower quartile above 10 (corresponding to 90% accuracy), while the per sequence quality score should have a mean base quality of 25 or higher (for RNA-seq reads, it is expected that base quality will decrease in later sequencing cycles), with the histogram curved to the right. The per base sequence content plot should indicate an even proportion of each base, and the per sequence GC content plot should demonstrate a normal distribution of GC Dual RNA sequencing (dRNA-Seq) of bacteria and their host cells 46 content; an abnormal distribution is likely evidence of contamination. In both the per base sequence content plot and per sequence GC content plot, a bias will be observed at the first bases which is specific to Illumina sequencing. The k-mer content plot will indicate the presence of k-mers at the early bases (also Illumina sequencing bias), but should otherwise not be at a higher than expected frequency.
The read length distribution plot should indicate that the read length is at least 50 nucleotides, as represented by a single peak.

Visualize alignments in IGV
IGV allows the visualization of reads mapped to the reference genome and provides an efficient way of identifying problematic data. The alignments should agree with the known gene structures such as intron placements, and the reconstructed transcripts should sufficiently represent the alignments. Verify that genes are roughly evenly covered with reads and if there are known differentially expressed genes, confirm differential coverage between sample groups. Splice junctions can also be visualized for the host reads.

Feature counting
HTseq will generate a separate tab-delimited file for all samples and replicates, containing the read count for every gene in the annotation file. This file is usually represented with two columns: the Ensembl gene identifier, and the read count: a number indicating how many reads overlap with the feature (gene) listed in the annotation file [5].

Multi-dimensional scaling plot and hierarchical clustering plot
In the MDS plot, replicate samples from each condition should generally cluster together, although some biological variation is expected. High sample variation may indicate noisy data and can be addressed by removing outlier samples, applying sample-specific quality weights as described in step 20, or applying additional normalization factors using specialized packages such as RUVseq [74].
Samples that cluster according to technical parameters, such as the time and date of sample collection, Dual RNA sequencing (dRNA-Seq) of bacteria and their host cells 47 usually indicates evidence of batch effects which can be corrected by incorporating the effect into the design matrix (step 19). See the Limma user guide for more details on this [73].

Host differential expression analysis
Following gene annotation (step 25), the table of differentially expressed host genes compiles a number of columns containing significant information comprising the output of the host side of the experiment.
The first column lists the Ensembl ID for each gene identified as being differentially expressed, while the remaining columns provide the statistical output for each gene, including (1) the log 2 fold-change (log FC) measured between experimental conditions, (2) the average log 2 expression (AveExp) for each gene, (3) the moderated t-statistic, which is the ratio of the log 2 fold-change to its standard error,

Bacterial transcript abundance analysis
The TPM values generated in step 35 are representative of the relative expression levels of the bacteria. Table 1. Statistical output of the differential expression analysis of host reads in R. The first column contains the ENSEMBL ID for the genes, logFC indicates the log fold-change observed, AveExpr is the expression value for each gene, t is the moderated t-statistic, P.Value is the raw p-value, adj.P.Val is the false discovery rate-adjusted p-value, and B is the log odds that the gene is differentially expressed.