Abstract
Entamoeba histolytica is a globally important pathogen that is dramatically understudied. Its challenging genome has limited tractability. To enable forward genetics, we constructed the first genome-wide E. histolytica RNAi knockdown mutant library. The library is designed to enable deep sequencing analysis for quantitative identification of mutants after selection. We developed a novel analysis pipeline to precisely define and quantify full-length gene fragments inferred from read mapping. We performed the first E. histolytica RNAi screen and identified slow growth mutants. Growth phenotypes were reproducible in independently generated mutants. Some of the genes targeted in slow growth mutants had annotated functions consistent with roles in growth or metabolism. Some targeted genes lacked annotation, supporting the power of forward genetics in uncovering gene function. This work opens up the possibility of applying genetics to improve understanding of this important pathogen. Moreover, the strategies behind this RNAi library, and its analysis, are novel, and can be applied to other organisms.
Introduction
Entamoeba histolytica is the causative agent of amoebiasis in humans. E. histolytica is common in developing nations and is responsible for ~50 million infections/year [1]. Children under five are particularly vulnerable to diarrheal infections including amoebiasis, which are responsible for close to half a million deaths annually [2]. In the first two years of life, approximately 80% of infants in an urban slum in Dhaka, Bangladesh become infected with E. histolytica [3]. Malnutrition and stunting are associated with childhood E. histolytica infections [4]. Among children aged 12 – 23 months enrolled in the in the Global Enteric Multicenter Study, E. histolytica infection was associated with the highest risk of death (5%) between enrollment and follow up [5]. Invasive infections result in an estimated 15,500 deaths/year in children and 67,900 deaths/year among people of all ages [2]. There is no vaccine and therapy is limited to nitroimidazoles [6]. The continued morbidity and mortality indicate that current treatment and prevention approaches are insufficient.
Infection can be asymptomatic, or can lead to diarrhea, bloody diarrhea with intestinal ulceration, and fatal abscesses outside of the intestine. Fitting with its capacity to cause ulcers and abscesses, E. histolytica was named for its ability to lyse tissue (histo-: tissue; lytic-: dissolving). Despite the name-giving property, it is not known how amoebic trophozoites (“amoebae”) damage tissue [7]. The major known virulence factors are the surface Gal/GalNAc lectin [8, 9], which mediates attachment to human cells, and the cysteine proteases, which cleave human substrates [10, 11]. It was recently shown that amoebae kill human cells by “nibbling” them, in a process termed trogocytosis [12]. Amoebic trogocytosis contributes to tissue damage, and warrants further study for its potential relevance to invasive disease [12]. However, trogocytosis is not yet well-defined at the molecular level [13].
Despite its impact on human health, E. histolytica is dramatically understudied relative to other parasites. Given the limitations of current therapies and the limitations of our current understanding of pathogenesis, new tools are needed to study this infection. Although the genome has been sequenced, it is difficult to identify genes relevant to pathogenesis from sequence information alone. The genome is relatively difficult to manipulate experimentally. E. histolytica is thought to be tetraploid [14]. The genome contains a very large number of expanded gene families, with 897 protein families that contain 4,564 genes, and represent 56% of the proteome [15, 16]. The nucleotide composition of the E. histolytica genome is ~75% A + T, which creates challenges for genomic approaches and has contributed to the relatively poor draft genome quality. The genome is ~20 Mb with ~8,000 genes [15, 17, 18]. For one third of genes, no homologues have been identified [17] and less than half of genes have been assigned gene ontology (GO) terms [15]. Thus, with limited sequence homology information and a difficult genome, functional approaches such as forward genetics are needed to identify genes that are relevant to pathogenesis.
Historically, the genetic tractability of E. histolytica has been limited. CRISPR/Cas9 has not yet been applied for use in E. histolytica. Previous efforts to integrate sequences into the genome or to genetically manipulate endogenous loci have been unsuccessful. Stable transfection with episomal plasmids has been used in order to overexpress tagged proteins or dominant negative mutants [19, 20]. Forward genetics have not yet been established in E. histolytica, outside of two studies that used a plasmid-based overexpression library to screen for phenotypes [21, 22]. Relatively recently, the endogenous RNAi pathway in E. histolytica has been characterized, and exploited as a useful tool for targeted gene knockdown.
The existence of an RNAi pathway in E. histolytica that could be exploited for gene knockdown was implied by studies that successfully knocked down expression of the amoebapore A gene [23]. Amoebae were stably transfected with a plasmid containing a fragment of the amoebapore A promoter together with a segment of a retrotransposon and a short T-rich stretch [23]. While this led to permanent silencing of amoebapore A, the mechanism of action was unclear, and the expression of off-target genes was affected [24, 25]. It was possible to silence other genes, but only in amoebae that already contained the amoebapore A silencing plasmid. A clonal line of stable transfectants carrying the amoebapore A plasmid was named the “G3” strain, and was subsequently used to knockdown additional genes in the context of an amoebapore A knockdown mutant background [25]. Other strategies for gene knockdown have included expression of genes in the antisense orientation [26], expression of long double stranded RNA [27], phagocytosis of bacteria expressing dsRNA [28], and expression of short hairpin RNA [29]. These techniques have been useful for reverse genetic studies, but not robust enough to be effective for every gene of interest.
After it had already been exploited for targeted gene knockdown, the E. histolytica RNAi pathway was characterized in detail. Small RNAs were found to associate with a Piwi-related protein [30]. Small RNAs were 27nt in length, 5’ polyphosphorylated, and mapped antisense to genes [30]. Thus far, 5’ polyphosphorylated siRNAs have only been seen in E. histolytica and Caenorhabditis elegans [31]. The level of E. histolytica small RNAs was inversely correlated with gene expression [30]. Supporting that these small RNAs are relevant to previous approaches that exploited gene silencing in E. histolytica, in the G3 strain, 27nt small RNAs mapped to amoebapore A, and when additional genes were silenced in this background, new small RNAs mapped to the newly silenced gene [32]. There is no obvious Dicer homologue, but a potential noncanonical Dicer and Argonaute homologues have been characterized [33, 34].
Because endogenous small RNAs are polyphosphorylated, it was initially a challenge to determine how to effectively exploit this pathway for gene knockdown. A major breakthrough was the discovery that RNAi knockdown could be induced by spreading of silencing [35]. A new “trigger” plasmid approach was established, which relies on endogenous small RNAs directed to a 132 bp fragment of an endogenously-silenced gene (called the trigger) to drive spreading of silencing to a fragment of gene of interest that is adjacent to the trigger [35]. This leads to long-term silencing of the gene of interest, even after the trigger plasmid is removed, which may be due to the deposition of the repressive histone mark H3K27 at the chromosomal copy of the silenced gene [36]. This trigger plasmid approach has been successfully applied to knockdown many genes, and has been adapted for use in E. invadens [37–40].
In other parasites, RNAi knockdown has been successfully adapted for forward genetic approaches, and this has been very effective in uncovering genes relevant to parasite metabolism [41, 42], DNA repair [43], virulence [44], transmission [45, 46], and drug therapies [47–50]. Initially, these approaches relied on the generation of clonal lines after selection, and Sanger sequencing analysis of individual clones to identify the knocked down gene [41, 42]. Later approaches have used next generation sequencing on non-clonal mutant populations in order to identify mutants, and their relative abundance [43–51]. The latter approaches are useful because they can quantitatively identify mutants that are enriched by selection, and mutants that are less abundant or are lost after selection.
To enable forward genetics, we constructed and validated the first genome-wide E. histolytica RNAi knockdown mutant plasmid library. Its design allows for Illumina deep sequencing, and our novel analysis pipeline allows for quantitative identification of plasmids that are enriched or depleted after selection. Using a pilot version of the library, we performed the first RNAi screen in E. histolytica and identified 12 slow growth mutants. Genes that were knocked down in slow growing mutants included genes for which existing annotation provided no clues on function. Independent knockdown mutants phenocopied the growth defects seen in the library mutants. In the course of this work, new details of the E. histolytica RNAi pathway were revealed, as we found that both frame and directionality of the gene of interest are dispensable for silencing. These studies establish the first RNAi screen in E. histolytica, thereby dramatically improving the tractability of this pathogen.
Results
Approach for quantitative RNAi-based forward genetics
We devised an overall strategy for RNAi-based forward genetics in E. histolytica (Fig. 1). For this approach, a library of genes would be introduced into the E. histolytica “trigger” RNAi plasmid (pTrigger) [35], to create an RNAi library (Fig. 1a). Amoebae would then be stably-transfected with the library, and exposed to a selective pressure or screening strategy. Illumina deep sequencing analysis would then be used to sequence the inserts and to compare the selected mutants to control, non-selected amoebae, in order to identify mutants that are quantitatively enriched or depleted. We empirically determined that the stable transfection efficiency with the backbone of pTrigger is ~1/1000 amoebae. This relatively high stable transfection efficiency suggested that transfecting enough amoebae to achieve genome-wide RNAi knockdown would be feasible.
The Illumina priming sequences that are necessary for sequencing are typically added to samples by using a nested strategy with two sequential PCR reactions. We modified the trigger plasmid in order to flank the insert with Illumina adaptor sequences, creating pTriggerAdaptor, so that after extracting amoebic genomic DNA (gDNA), only a single PCR step would be needed to generate DNA ready for sequencing (Fig. 1b). Using fewer PCR reactions is expected to reduce the potential for PCR bias that can change the representation of sequences. An additional benefit of this modification is that the original sequences flanking the insert were A/T rich and were not optimal for primer design. The insertion of these Illumina sequences into the knockdown plasmid did not change knockdown efficacy (Fig. 1c).
Construction of a plasmid RNAi library
Building on these findings, we devised a strategy to construct a genome-wide plasmid RNAi library (Fig. 2a). We reasoned that since ~50% of the genome is coding, constructing a gDNA library would be a practical approach. To generate the plasmid library, random fragments of gDNA were inserted into the pTriggerAdaptor plasmid. Library construction was initially done on a small scale, to create a “pilot” library, where plasmids were harvested from a limited number of bacterial colonies. Building on the results of the pilot, a second library was constructed on a larger scale, to make a “final” library that was expected to have genome-wide coverage.
Previous studies defined that an insert size ≥ 500 bp is sufficient for knockdown [35, 52]. Larger insert sizes are likely to lead to more efficient knockdown, but fragments > 1 kb are problematic for Illumina sequencing. The pilot library was constructed using gDNA fragments between ~400 – 600 bp. Building on the robust Illumina sequencing results from the pilot library (described in detail below), we constructed the final library, where gDNA fragments were larger (between ~500 – 900 bp) in order to maximize the efficacy of RNAi knockdown.
To make the RNAi library, high quality, RNA-free, gDNA (Fig. S1a – b) was sonicated and size-selected (Fig. S1c). Y-shaped adaptors were ligated to the fragmented DNA to ensure that each fragment was flanked by sequences matching the trigger plasmid upstream and downstream of the cloning site, in order to enable subsequent Gibson cloning. Adaptor ligation was successful, with an appropriate size shift seen using bioanalyzer analysis (Fig. 2b – c). Gibson cloning and plasmid preps were done en masse with pooled plasmids. To check cloning efficiency, individual colonies were obtained and analyzed. Restriction analysis demonstrated that the ligation efficiency was very high, since 63 out of 67 plasmids examined contained an insert (Fig. 2d).
The RNAi library has genome-wide coverage
Illumina sequencing was used to analyze the pilot and final libraries, and the fragmented gDNA that was used to construct each library. Notably, our approach for analysis of the sequencing data is novel and differs from previous approaches for analysis of RNAi screens [43–51], since all results presented here are based on full-length DNA-fragments inferred from read mapping results (Fig. S2), rather than on more traditional read mapping coverage or CDS overlap counting strategies. Our approach was designed to better represent the actual DNA fragments present in the plasmid library, instead of just the short ends of these fragments that were sequenced on the Illumina platform. This strategy allowed us to precisely define whether a fragment might target one or multiple genes, and enabled a clearer view of fragment based genome coverage. This approach will also allow us to readily quantify the representation of gene fragments in future RNAi screen experiments, in order to identify specific fragments that are particularly impactful in a particular screen.
There did not appear to be bias in gDNA fragmentation or construction of the library, since the nucleotide and dinucleotide frequencies of the fragmented gDNA and the plasmids were the same as the genome (Fig. 3a and S3a). Rarefaction curves and read fragment analysis showed that sequencing was performed to a sufficient depth to identify all unique fragments within the plasmid samples, but the same depth of sequencing was not deep enough to identify all unique fragments within the fragmented gDNA samples (Fig. S3b – c).
Illumina sequencing analysis was performed on separate batches of the plasmid library that had been cloned on different days, and on technical sequencing replicates. Similar read counts per fragment were obtained from each batch and each sequencing replicate, showing that cloning and sequencing were reproducible and there was no evidence for bias within these steps (Fig. S4a – b, Table S1). Analysis of the sequencing results showed that the gDNA fragments that were used to create the pilot library had a mean size of 421 bp, and the fragments in the pilot plasmid library had a mean size of 427 bp (Fig. 3b). The gDNA fragments that were used to create the final library had a mean size of 540 bp, and the fragments in the pilot plasmid library had a mean size of 555 bp (Fig. 3b). The nearly identical size ranges of the gDNA fragments and cloned fragments further support that there was no bias during library construction.
Genome coverage plots of the final plasmid library over the 5 largest contigs in the genome visually showed that coverage of these contigs was complete, and the depth of coverage was consistent (Fig. 3c). The breadth and depth of coverage over a zoomed in region of the largest contig is represented in Fig. 3d. On a genome-wide level, the genome was well covered by fragments that were cloned into the library in both the forward and reverse orientation relative to the orientation of coding sequences (Fig. 4, Table 1). Supporting that the final library contained genome-wide coverage, 97.24% of the genome was represented (Table 1). 8307 of 8333 annotated genes were represented, and the majority (8301 genes) were represented by more than one unique fragment. Of the 26 genes that are missing, 22 are likely to be misannotated in the genome, since they were not seen in any sample, including fragmented gDNA (Fig. S5). Additionally, for the majority of these genes, there was no empirical evidence for gene expression in prior RNAseq or proteomic studies (Tables S2 and S3). From these findings, we concluded that the final plasmid library has genome-wide coverage. We named the final library pEhKDL, since it represents an E. histolytica knockdown library.
Forward genetic isolation of slow growing mutants
While the final library was under construction, E. histolytica was stably transfected with the pilot library. To determine if the RNAi library could be used to identify genes that contribute to a phenotype, we screened for mutants with abnormal growth. To first ask how many different plasmids were present per stably transfected amoeba, clonal lines were obtained using limiting dilution. Sanger sequencing analysis of clonal lines was consistent with only one unique plasmid (i.e., one unique insert) per transfected amoeba (Fig. S6).
To screen for growth phenotypes, growth rates were measured for a selected number of clonal lines. Slow Growth mutants 1 (SG1) and 2 (SG2) were isolated first and chosen for further characterization. These mutants had inserts corresponding to EHI_194310 and EHI_007010, which are annotated as a putative amino acid transporter and a hypothetical protein, respectively (Table 2). These mutants had a statistically significant growth defect (Fig. 5a). We have previously knocked down superoxide dismutase (EHI_159160) and determined that this results in a growth defect; thus, superoxide dismutase RNAi was used as a control. RT-PCR analysis showed that EHI_194310 and EHI_007010 were knocked down in the corresponding mutants (Fig. 5b).
To determine if knockdown mutant phenotypes were reproducible, Gibson cloning was used to generate independent knockdown plasmids that contained a fragment of either EHI_194310 or EHI_007010, which differed from the fragment in the library mutants. E. histolytica was stably transfected with each of these new plasmids to create the independent knockdown mutants, SG1i and SG2i. Growth of heterogeneous, stable transfectants was measured, and these mutants had statistically significant growth defects (Fig. 5c). RT-PCR analysis demonstrated that the corresponding target genes were knocked down (Fig. 5d). The library mutants and independently generated mutants each exhibited statistically significant growth defects (Fig. 5e). Thus, the phenotypes seen in library mutants were reproduced in independently generated mutants.
Bioinformatic analysis of slow growing mutants
From the pilot RNAi library transfection, ~64 additional clonal lines were obtained and 10 additional clonal lines with statistically significant growth defects were identified (Fig. S7; Table 2). The insert present in each clonal line was identified by Sanger sequencing, and the results were again consistent with one unique plasmid per transfected amoeba. Inserts ranged in size from 322 to 555 bp. In previous studies that used pTrigger for gene silencing, the gene of interest for knockdown was in the forward orientation and in frame with the trigger [35, 37–40, 52]. In the slow growth mutants, inserts were in the forward or reverse orientation, and in and out of frame (Table 2). Inserts that were in the reverse orientation and out of frame, like SG1, nonetheless led to measurable knockdown (Fig. 5b). Both the directionality of the gene of interest, and the frame, are therefore dispensable for knockdown.
Nine of the twelve slow growth mutants contained an insert that mapped to a gene (Table 2). The three mutants with inserts that did not clearly map to an annotated gene (SG3, SG9 and SG12) did map to potentially unannotated ORFs, and also partially mapped (i.e., contained stretches of homology) to annotated genes. BLAST was used to search for homologs of the ORFs that mapped to SG3, SG9, and SG12, and homologs were found in E. dispar for each ORF (Expect values of 4e -31, 1e -10, 1e -9, respectively), suggesting that these ORFs are genes that were missed in the annotation of the E. histolytica genome. The E. histolytica genome contains a very large number of gene families [15, 16], evidencing a high level of functional redundancy. One mutant, SG5, contained an insert that mapped to more than one gene (Table 2). Due to the level of identity between the multiple hits, multiple related genes may be knocked down in this mutant.
Several mutants contained inserts that mapped to genes with annotated functions consistent with roles necessary for cellular growth, such as an amino acid transporter, a major facilitator transporter, and a splicing factor (Table 2). Three mutants contained inserts with predicted roles in metabolic pathways, based on data from KEGG Metabolic Pathways (https://www.kegg.jp), the MetaCyc Metabolic Pathway Database [53], and Chemical Entities of Biological Interest (ChEBI) database [54] (Table 2). Several mutants contained inserts that mapped to genes annotated as hypothetical, or that lacked annotated domains (Table 2). Thus, in addition to validating the RNAi library approach, the 12 slow growth mutants that were uncovered shed light on genes that are required for E. histolytica growth and emphasize how forward genetics can be used to assign functional information to hypothetical proteins.
Discussion
We constructed and validated the first genome-wide E. histolytica RNAi knockdown mutant library. We performed a screen for slow growth mutants and identified 12 new mutants, providing a proof-of-concept that demonstrates the library can be used to identify mutant phenotypes. Independent knockdown mutants phenocopied the growth defects seen in the library mutants, showing that library mutant phenotypes are reproducible. Thus, our studies both identify genes that are critical for growth, and establish RNAi-based forward genetics in E. histolytica for the first time. The library represents a valuable new tool for E. histolytica, that can now be applied towards the study of many processes.
Slow growth mutants were identified among clonal knockdown mutant lines. 12 mutants were identified among ~80 total clonal lines. Among the identified genes targeted in these mutants, many genes had annotated functions consistent with roles in cellular growth, and/or had predicted roles in metabolic pathways, based on available metabolic pathway databases (https://www.kegg.jp/) [53, 54]. No genes were identified for which a role in cellular growth would be implausible. Several of the identified genes were annotated as hypothetical, or lacked annotated domains. All of the gene hits in our screen had previous evidence of gene expression in RNAseq analyses [55, 56]. Thus, the slow growth phenotype begins to provide functional information to genes for which existing annotation does not inform understanding of protein function. All of the annotated gene hits in our screen had homologues in the related human-infectious Entamoeba species E. dispar, E. nuttali and E. moshkovskii. Thus, this type of screen may be informative in identifying potential drug targets that are conserved among human-infectious Entamoeba species.
The E. histolytica genome contains numerous gene families [15, 16]. An advantage of RNAi vs. alternative approaches (e.g., gene knockouts) is the potential to simultaneously interfere with multiple gene family members. RNAi is similarly advantageous for genomes that are polyploid, which is also relevant to the potentially tetraploid [14] E. histolytica genome. We identified one mutant that contained an insert that mapped to more than one gene. With the high levels of sequence identity, it is likely that multiple genes are knocked down in this mutant. We were able to identify a slow growth phenotype in this mutant, which suggests that the RNAi library may be effective in identifying mutant phenotypes even in the context of high levels of functional redundancy.
We have used pTrigger to knockdown many genes in our laboratory, and, in general, have been able to recover knockdown mutants for essentially every targeted gene. Compared to clonal lines, heterogeneous transfectants have typically had more variable levels of knockdown, and sometimes appear to have lower levels of knockdown (unpublished data). Among clonal lines obtained from heterogeneous transfectants, we have seen wide variation, including clones with high levels of knockdown and clones that lack knockdown. The SG1 and SG2 clonal lines initially obtained from the library appeared to have an equal or higher level of knockdown than the heterogeneous independent SG1i and SG2i mutants that targeted the same gene. The heterogeneous independent knockdown mutant SG1i had a less severe growth phenotype than the clonal library mutant SG1, which may be due to variation in knockdown in the heterogeneous population. Loss of gene silencing (and loss of a growth phenotype) over time was noted in the heterogeneous SG2i mutants (data not shown), consistent with the selective disadvantage of the growth phenotype. Related to this point, when a selection is applied to a library mutant population, it will be important to compare selected and non-selected controls, in order to differentiate mutants that are lost due to a growth disadvantage vs. mutants that are lost due to selection.
The library was designed to facilitate Illumina deep sequencing analysis in order to quantitatively identify mutants that are enriched or lost after selection. We constructed pilot and final versions of the library. Multiple lines of evidence supported that there was no bias in library construction, including nucleotide and dinucleotide frequencies, and size comparisons between the gDNA fragments and fragments in the final plasmids. Libraries were prepared in batches that were cloned on different days. Similar read counts per fragment were obtained from batches, again supporting that there was no bias in cloning or sequencing analysis. The majority of the genome (97.24%) was covered in the final plasmid library, and the majority of genes were represented (8307 out of 8333 annotated genes). 26 annotated genes are not represented. Of these, 22 genes were not seen in any sample, including gDNA, suggesting that they represent errors in annotation. Consistent with this, there was no published evidence for gene expression for the majority of these genes. The 4 remaining genes that were not seen in the final library had some of the lowest sequencing coverage in the fragmented gDNA sample, suggesting that they were missed due to lack of abundance in the fragmented gDNA that was used for library construction. Thus, with 8307 out of 8311 correctly annotated genes represented, and 97.24% of the genome, we concluded that the library represented full genome coverage.
A future variation on this approach would be to make a library using fragmented cDNA instead of gDNA. We reasoned that since ~50% of the genome is coding, gDNA is a reasonable approach, since ~50% of the plasmids in the library would lead to gene silencing. A limitation of a gDNA library is that some sequences such as rRNA and tRNA, which are abundant in the genome, are represented in the library, and are not useful for identification of mutant phenotypes. A cDNA library would remove these sequences, but would carry other limitations. The most highly expressed genes would be overrepresented in a cDNA library, and genes that are not expressed in the conditions in which cDNA was obtained would not be represented in the library. The lower abundance of material for fragmentation and cloning is also a consideration. mRNA is a very small proportion of total RNA, and in our initial attempts to prepare cDNA for library construction, insufficient material was recovered. With more initial total RNA, a cDNA library could be constructed in the future, and would be complementary to the gDNA library that was constructed here.
We asked whether the frame or the directionality of the insert was critical for knockdown using the trigger plasmid. In previous studies, the gene of interest for knockdown was in the forward orientation and in frame with the trigger [35, 37–40, 52]; thus, the frame and orientation requirements had not been tested experimentally. Among the slow growth mutants that we identified, inserts were in the forward or reverse orientation, and in and out of frame. Note that the adaptor sequence that was inserted between the trigger and the insert does not change the frame. Inserts that were in the reverse orientation and out of frame still led to knockdown. Both the directionality of the gene of interest, and the frame, are therefore dispensable for knockdown. Since the RNAi pathway in E. histolytica is unusual and mechanistic details are still coming into focus, our results add to understanding of this pathway. The lack of orientation and frame requirements for knockdown is useful, since gDNA fragments are random and thus are ligated into the library plasmid in either orientation and in any frame. Previous studies had defined that a gene fragment ≥ 500 bp is sufficient for knockdown [35, 52]. We found that fragments as small as 322 bp were sufficient to result in slow growth phenotypes, suggesting that efficient knockdown is possible with smaller fragments.
E. histolytica is dramatically understudied relative to other parasites, even though it heavily impacts human health, causing significant morbidity and mortality. The RNAi library that we developed dramatically improves the tractability of E. histolytica. In other parasites, RNAi screens have uncovered genes relevant to molecular and cellular biology, virulence and therapeutics. We look forward to the new information about E. histolytica that can be gleaned by applying RNAi screens in the future. Since the design of this RNAi library allows for quantitative identification of mutants that are enriched or depleted after selection, we expect that this library can be used to identify genes involved in a wide variety of processes, including essential genes that are relevant to the development of new therapies.
Materials and Methods
Cell culture
E. histolytica HM1:IMSS (ATCC 30459) trophozoites were cultured in TYI-S-33 media, supplemented with 15% heat inactivated Adult Bovine Serum (Gemini Bio Products), 80 units/ml penicillin and streptomycin (Gibco), and 2.3% Diamond Vitamin solution 80 Tween 40x (Sigma Aldrich), at 35°C. Amoebae were harvested when flasks were approximately 80% confluent.
gDNA Extraction
E. histolytica gDNA was harvested using the Quick-DNA miniprep kit (Zymo), cetyl trimethylammonium bromide (CTAB) extraction, or the Wizard gDNA extraction kit (Promega). The Quick-DNA miniprep kit (Zymo) was used according to manufacturer’s instructions for cell suspension samples. For CTAB extraction, cells were washed and then lysed at 55°C for 1 hour in a solution of 0.1 M EDTA with 0.25% SDS and 0.1 mg/ml proteinase K. NaCl was added to a final concentration of 0.7 M, pre-warmed CTAB was added to a final concentration of 1%, and samples were incubated for 20 minutes at 65°C. An equal volume of chloroform was added, and samples were centrifuged at 20,600 x g for 10 minutes. Supernatants were transferred into fresh tubes, an equal volume of phenol:chloroform:isoamyl alcohol (25:24:1) was added, and samples were centrifuged at 20,600 x g for 10 minutes. Supernatants were transferred into fresh tubes, DNA was precipitated with 2.5 volumes of 100% ethanol at room temperature for 5 minutes, and samples were centrifuged at 20,600 x g for 10 minutes. DNA pellets were washed in 70% ethanol. DNA was left to dry at room temperature and dissolved in water. For DNA harvested using the Quick-DNA miniprep kit or CTAB extraction, RNAseIF (NEB) was used to remove RNA contamination, according to manufacturer’s instructions. The Wizard Genomic DNA Purification kit (Promega) was used according to the manufacturer’s instructions for the tissue culture cells protocol. Because DNA harvested using the Wizard Genomic DNA Purification kit was of higher quality, gDNA for plasmid library construction was prepared using this kit.
Plasmids and cloning
The trigger plasmid [35] contains 132 base pairs of the trigger gene (EHI_048600) in the pKT3M expression plasmid backbone [57]. Q5 Site Directed Mutagenesis (NEB) was used to insert a SmaI restriction enzyme site after the trigger for cloning purposes, to create “pTrigger”. To create knockdown plasmids to target individual genes, pTrigger was linearized using SmaI (NEB) and dephosphorylated using rSAP (NEB). A fragment of the gene of interest was amplified from gDNA using PCR primers that contained homology to pTrigger, and the PCR product was then cloned into the backbone using the Gibson Assembly Ultra kit (Synthetic Genomics Inc.), following the manufacturer’s instructions. See Table S4 for the sequences of primers that were used during plasmid construction.
To test if the Illumina adaptor sequences interfere with knockdown, pTrigger plasmids were created with and without adaptors. First, a pTrigger plasmid targeting a BAR/SH3 domain-containing protein was created by inserting 812 bp of EHI_091500 (147 – 958). This plasmid was further modified using Q5 Site Directed Mutagenesis (NEB) to add the custom adaptors CACTCTTTCCCTACACGACGCTCTTCCGATCT and GATCGGAAGAGCACACGTCTGAACTCCAGTCAC, which are compatible with Illumina TruSeq sequencing primers, upstream and downstream of the insert, respectively. After determining that the presence of the Illumina adaptors did not interfere with knockdown, this construct was modified to remove the BAR/SH3 domain-containing protein and replace it with a SmaI restriction site for further cloning purposes, to create pTriggerAdaptor. pTriggerAdaptor was used as the backbone for plasmid library generation.
Superoxide dismutase knockdown mutants were generated by inserting 597 bp of EHI_159160 (1 – 597) into pTrigger. Independent knockdown mutants (SG1i) of SG1 were generated by inserting 525 bp of EHI_194310 (32 – 556) into pTrigger. Independent knockdown mutants (SG2i) of SG2 were generated by inserting 487 bp of EHI_007010 (59 – 545) into pTrigger.
Plasmid library construction
E. histolytica gDNA was harvested using the Wizard Genomic DNA Purification kit (Promega), following the manufacturer’s instructions for the tissue culture cells protocol. RNAse treatment is included in this protocol. Quality and size of gDNA was verified by Qubit and gel electrophoresis. gDNA was further purified and concentrated using TailorMag Purification beads (SeqMatic), and then sheared using a Covaris M220 sonicator. For size selection, sheared gDNA was loaded into three lanes on a 6% TBE gel. The gel was stained using SYBR Safe DNA Gel Stain (Invitrogen), and three bands, covering from ~400 – ~600 bp (for the pilot library) and ~500 – ~900 bp (for the final library) were extracted from the gel using the TailorCut Gel Cutter and Gel Breaker system (SeqMatic). gDNA was purified using TailorMag Purification beads. The sizing and concentration of the purified material was determined by TapeStation (Agilent). Purified gDNA corresponding to the second and third gel slices from each lane was combined. gDNA was further purified and concentrated using TailorMag Purification beads, and then processed using a TruSeq XT library prep kit (Illumina). All steps followed the manufacturer’s protocol except: 1. Ligation time was increased to 180 minutes, 2. Custom adaptors were used instead of Illumina TruSeq adaptors. The adaptors that were ligated had the following sequences: Adaptor 1: ACACTCTTTCCCTACACGACGCTCTTCCGATCCCC*T and Adaptor 2: /5’Phos/GGGGATCGGGAGAGCACACGTCTGAACTCTAGTCAC. After adaptor ligation, the gDNA fragments were amplified using 5 rounds of PCR with KAPA HiFi 2X master mix (KAPA Biosystems). See Table S4 for primer sequences. After PCR, fragmented gDNA was purified and concentrated using TailorMag Purification beads.
pTriggerAdaptor was linearized using SmaI (NEB) and dephosphorylated using rSAP (NEB). Linearization was verified using gel electrophoresis. 25 ng of fragmented gDNA with ligated adaptors was combined with 25 ng of linearized pTriggerAdaptor. The Gibson Assembly Ultra kit (Synthetic Genomics Inc.) was used according to the manufacturer’s instructions. High Efficiency 5-alpha Competent E. coli (NEB) were transformed with 3 μl of the Gibson mix and plated on four LB plates containing ampicillin (Sigma). We determined that the Gibson cloning efficiency with pTriggerAdaptor was ~1/250, which was used to calculate the amount of insert needed to generate a library at a given target level of coverage. With an average insert size of ~550 bp, a minimum of ~36,354 plasmids would be needed to fully cover the genome at 1X coverage. For the pilot library, one E. coli transformation reaction was performed and 4 plates of bacteria were harvested into 200 ml of LB liquid broth containing ampicillin, and incubated at 37°C for 12-18 hours, shaking. For the final library, Gibson cloning, E. coli transformation, and plasmid preparations were performed over four separate days to create four batches of final library plasmids. The final plasmid library represents the combined total of the four batches.
For the final library, 25 E. coli transformation reactions were performed and 100 plates of bacteria were harvested into LB liquid broth containing ampicillin, at a concentration of ~1,000 colonies per 200 ml. Plasmids were harvested using the GenElute HP plasmid maxiprep kit (Sigma), using 200 ml of culture per plasmid purification column, and ethanol precipitated according to the manufacturer’s instructions.
Illumina sequencing of gDNA fragments and plasmid samples
Fragmented gDNA (with ligated custom adaptors; see ‘plasmid library construction’), or plasmid inserts were amplified with KAPA PCR (Kapa Biosystems) using full length TruSeq Illumina primers, as follows: Primer 1: GATCGGAAGAGCACACGTCTGAACTCCAGTCAC[i7]ATCTCGTATGCCGTCTTCTGCTTG and Primer 2: AATGATACGGCGACCACCGAGATCTACAC[i5]ACACTCTTTCCCTACACGACGCTCTTCCGATCT. Three technical replicates were sequenced for each sample. Samples were sequenced on an Illumina MiSeq at either the DNA Technologies and Expression Analysis Core (University of California, Davis), or on an Illumina MiSeq at SeqMatic, LLC. Illumina sequencing data analysis was performed at the Bioinformatics Core (University of California, Davis).
Illumina sequencing analysis
Illumina sequencing data was demultiplexed using barcode sequences added during library preparation with Illumina demultiplexing software. All reads were sequenced as PE 151bp runs. The resulting paired reads were then cleaned using HTStream (https://github.com/ibest/HTStream) to 1) remove reads containing PhiX Control Library sequences, 2) trim Illumina sequencing adaptors, and 3) discard primer dimers and other reads where adaptor trimming was unsuccessful.
Cleaned reads were mapped against the E. histolytica release 46 genome assembly (downloaded from https://amoebadb.org/common/downloads/release-46/EhistolyticaHM1IMSS/fasta/data/AmoebaDB-46_EhistolyticaHM1IMSS_Genome.fasta) using bwa mem v2.0pre1 [58]. Output from bwa was passed to samtools v1.10 [59] for sorting and indexing, and also to extract reads from for visualization of specific loci. Results from mapping and subsequent analysis were visualized and reviewed in Geneious Prime 2020.0.3 (https://www.geneious.com). Identifying unique fragments and counting their frequency was done using a custom Python script. Briefly, the script iterates over read-sorted BAM records. Read pairs that were mapped to the same contig with proper orientation, a minimum of 146 matched bases, and no more than four bases soft clipped from the 5’ end of either read pair were used to identify fragments. Read mapping orientation was used to infer fragment orientation. A counter for each unique fragment was incremented for additional reads supporting the fragment. Reads that were mapped in proper orientation but that did not meet other criteria were compared against the fragment intervals with an incremented count assigned to the highest scoring overlap. As a last step, the script generates summary statistics, an interval table with counts for all samples, and SAF and gff annotation files for each sample.
Output from the Python script was then loaded into R v3.6.1 [60] for further analysis and to generate figures. Briefly, data.table [61] was used to load the full set of intervals, order them by genomic position, and to compute summaries by sample group. Tidyr [62]was used for reformatting data, and most plots were generated with ggplot2 [63]. The vegan package [64] was used for calculating saturation curve values. The GenomicRanges package [65] was used in combination with gene annotation (downloaded from https://amoebadb.org/common/downloads/release-46/EhistolyticaHM1IMSS/gff/data/AmoebaDB-46_EhistolyticaHM1IMSS.gff) to determine and count fragments targeting genes, all reported counts used a minimum overlap of 27bp. Genome coverage plots were generated with karyoploteR [66] and heatmaps were plotted using pheatmap [67]. Scripts and more details about the analysis are available at the GitHub repository associated with this publication: https://github.com/samhunter/Bettadapur-et-al-Entamoeba-histolytica-RNAi-library.
All sequencing data from this project have been deposited in the Sequence Read Archive (SRA) database (https://www.ncbi.nlm.nih.gov/sra), under project number SUB7459667.
Transfection and clonal lines
Library plasmids were transfected into amoebae using Attractene transfection reagent (Qiagen) [12, 68]. 50 ml M199 (Gibco medium M199 with Earle’s salts, L-glutamine, and 2.2 g/liter sodium bicarbonate, without phenol red), was supplemented with 0.045 g L-Cysteine, 0.3 g HEPES, 0.005 g ascorbic acid (all from Sigma) and the pH was adjusted to 6.8. To 42.5 ml of this medium, 7.5 ml of heat inactivated Adult Bovine Serum (Gemini bio-products) was added to make transfection medium. Amoebae were resuspended in transfection medium at a concentration of 2.5×105 cells/ml, and then 1.8 ml of amoebae, 10 μg plasmid DNA (in 200 μl of transfection medium), and 37.5 μl Attractene reagent (Qiagen) were combined into a 2.0 ml screw-cap cryotube. Cryotubes were incubated horizontally at 35°C for 3 hours. Two cryotubes per transfection were then added to a T25 flask containing 66 ml of fresh TYI media. 24 hours later, 6 μg/ml geneticin (Invitrogen) was added. Stable transfectants were selected and maintained with geneticin at 6 μg/ml.
The efficiency of stable transfection was determined by performing limiting dilution. Three independent experiments were performed. Amoebae were transfected using the process described above, except that 24 hours after transfection, amoebae were harvested and resuspended in TYI at a concentration of 2×106 cells/ml. Amoebae were then serially diluted into 96-well plates, in the presence or absence of 6 μg/ml geneticin. Plates were incubated in GasPak EZ pouch (BD Biosciences) and growth was monitored daily. The plates without Geneticin were used to calculate the plating efficiency. In the plates without geneticin, growth was seen in wells corresponding to approximately 2.5 cells on average. In the plates containing geneticin, growth was seen in wells containing 781 to 3125 amoebae. Taken together, the corrected stable transfection efficiency was ~1/1000 amoebae.
For generation of clonal mutant lines for phenotypic characterization, limiting dilution was performed in 96-well plates in a GasPak EZ pouch (BD Biosciences). For identification of mutants with growth defects, clonal lines that visually appeared to have slow growth were selected for further analysis. RNA was extracted using the Direct-zol RNA kit (Zymo). gDNA (which contains exogeneous plasmids) was extracted using the Quick-DNA miniprep kit (Zymo). Sanger sequencing was used to identify the library insert associated with each clonal line. Sequencing was carried out by Genewiz. To identify the gene targeted for knockdown in each clonal line, the insert sequence was used to search the reference genome. Genes with at least 50nt of continuous sequence identity with the insert were considered hits. See Table S4 for sequencing primer information.
Growth analysis
Clonal stable transfectants were resuspended to 8×103 cells/ml and 1 ml of cells was added to three tubes containing 13 ml fresh TYI with geneticin. Cells were incubated at 35°C. At days 2, 4, and 6, one tube of cells was harvested per day and counted using a TC20 Automatic cell counter (BioRad). Vector control transfected amoebae (containing pTriggerAdaptor) were used as a negative control, and superoxide dismutase heterogeneous knockdown mutants were used as a positive control for slow growth. Two technical replicates were performed per experiment and three independent growth curve experiments were performed for each mutant.
RT-PCR
RT-PCR was used to assess knockdown in transfectants. RNA was extracted using the Direct-zol RNA kit (Zymo). DNAseI (Turbo) treatment and SuperScript Reverse Transcription II (Invitrogen) was performed as per manufacturer instructions to obtain cDNA. RT-PCR was performed on cDNA harvested from transfectants containing the following constructs: BAR/SH3 domain-containing protein with and without Illumina adaptors, Superoxide Dismutase; SG1/SG1 independent knock down; and SG2/SG2 independent knockdown. GAPDH (EHI_187020) RT-PCR was used as a positive control. See Table S4 for RT-PCR primer sequences.
Statistical analysis
GraphPad Prism was used for student’s unpaired t test statistical analysis and normalization (baseline removal). Mean values and standard deviations are shown in the figures, with t test values reported as follows: ns = p > 0.05, * = p < 0.05, ** = p < 0.01, *** = p < 0.001, **** = p < 0.0001.
Author Contributions
A.B. designed and performed all of the experiments, except fragmentation of gDNA. C.G.B. fragmented gDNA and processed the fragments to enable subsequent plasmid library construction. S.S.H. designed the analysis strategy for Illumina sequencing data, and analyzed all Illumina sequencing data. M.L.S. contributed to the overall strategy for constructing, sequencing and analyzing the plasmid library, and oversaw the Illumina analysis approach. K.S.R. conceived of the overall approach and oversaw the design and analysis of the experiments. A.B., S.S.H., and K.S.R. wrote the manuscript.
Competing Interests
The authors declare that they have no competing interests.
Supplementary Figure Legends
Figure S1. Preparation of RNA-free gDNA from E. histolytica.
Gel electrophoresis analysis of gDNA extracted using the Wizard gDNA extraction kit (Promega), Quick-DNA miniprep kit (Zymo), and CTAB extraction. a, gDNA isolated using the Wizard kit had no apparent RNA contamination, and the gDNA appeared to be of high quality. b, gDNA isolated using the Quick-DNA kit or CTAB was contaminated with RNA. RNA contamination was removed by treatment with RNAseIF and subsequent purification. c, Bioanalyzer traces of pilot and final gDNA samples, after fragmentation and size selection.
Figure S2. Fragment Mapping Analysis.
Example view of mapped genomic fragments (“intervals”) represented in the final plasmid library. For comparison, the mapped paired end reads that give rise to each fragment are shown below each fragment. Shown are results from final plasmid batch two, sequencing replicate one, mapped to contig DS571153.
Figure S3. Appropriate sequencing depth and lack of bias during library construction.
a, Dinucleotide frequency analysis of the HM1:IMSS reference genome, the pilot and final gDNA fragments, and the pilot and final plasmid libraries. Stacked bar plots show the percentage of each dinucleotide. The dinucleotide composition of the gDNA fragments and the final plasmid library is the same as the genome. b – c, Rarefaction curves for pilot (panel b) and final (panel c) samples. gDNA fragment samples are shown in grey and plasmid samples are shown in orange. The number of fragments detected with increasing sequence reads shows if sequencing was performed to saturation. Three sequencing replicates are shown for pilot gDNA samples and three sequencing replicates are shown for pilot plasmid samples. Three sequencing replicates are shown for final gDNA samples and three sequencing replicates are shown for each of four batches of final plasmid samples.
Figure S4. Reproducible fragment and read counts from biological and technical replicates.
“Batches” of the final plasmid library were cloned on different days, and represent biological replicates. “Reps” correspond to independently prepared sequencing preparations, and represent technical replicates. a, Box plot of read counts per fragment in plasmid samples, with pilot plasmids in orange and final plasmids in grey. Log2 read counts are shown; outliers (outside of 2nd and 98th percentile) are plotted as points, and the box represents the 25th, and 75th percentile with median indicated as a bold line. Most fragments were represented by multiple sequenced reads. b, Stacked bar plots of the total number of fragments detected per sample shows that replicates are similar to each other in the number of fragments identified and that roughly equal numbers of forward and reverse fragments were produced consistently. Forward fragments are in the same orientation as genes that are in the forward orientation in the reference genome. Reverse fragments are in the same orientation as genes that are in the reverse orientation in the reference genome.
Figure S5. Annotated genes that were missed or were least covered by sequencing reads.
Heat map showing the number of unique fragments per gene, for the 50 least covered genes in the pilot and final gDNA fragments, and the pilot and final plasmid libraries. The color intensity indicates Log2 fragment coverage, with dark blue indicating no coverage. Stranded fragments are in the same orientation as genes. Unstranded fragments are the total number of fragments, both in the same orientation as genes and in the opposite orientation as genes.
Figure S6. Clonal transfectants contain only one library plasmid.
SG1 gDNA was extracted using the Quick-DNA miniprep kit and PCR amplified using the CS5’ and CS3’ primers that sit in the pTriggerAdaptor backbone. Sanger sequencing was performed using the RNAi Sequencing Library F primer. Adaptor 2 starts at the “GGGG” series of nucleotides, and upstream of this is the SG1 (EHI_194310) sequence. Noncompeting peaks in the chromatogram suggest that there is one unique genome fragment present in this mutant.
Figure S7. Identification of slow growing RNAi knockdown mutants.
Growth analysis of mutants SG1 – SG12. a, c, e, Growth analysis demonstrates that selected clonal knockdown mutant lines have significant growth defects compared to vector control transfectants. b, d, f, Growth was normalized and compared to vector control transfectants on each day. Mutants exhibited statistically significant growth defects relative to vector control transfectants.
Supplementary Table Legends
Table S1. Sequencing and fragment count statistics.
Total number of paired reads mapping to the same contig (mappedPairs) and the number of fragments inferred from these mapped pairs for each sample. Reads were mapped with bwa mem, and processed from the resulting SAM formatted file using a custom Python script. Three sequencing replicates were performed for pilot gDNA samples and three sequencing replicates were performed for pilot plasmid samples. Three sequencing replicates were performed for final gDNA samples and three sequencing replicates were performed for each of four batches of final plasmid samples.
Table S2. Expression data for annotated genes that were missing in the final plasmid library.
Available empirical expression data for genes that were missing in the final plasmid library (genes that are represented in blue in Fig. S3). Accession numbers, annotation information, and the number of paralogs in the reference genome are indicated. Presence (+) or absence (−) in gDNA and plasmid datasets is shown. Gene expression from available RNAseq datasets [55, 56] is indicated, as the FPKM + 1 value. The highest value for FPKM + 1 for each gene is shown. The total number of spectra that have been identified in proteomics studies is indicated.
Table S3. Expression data for representative genes present in the final plasmid library.
To provide a frame of reference for the expression data in Table S2, shown are the available empirical expression data for 10 random, representative genes that were present in the final plasmid library. Accession numbers, annotation information, and the number of paralogs in the reference genome are indicated. Presence (+) or absence (−) in gDNA and plasmid datasets is shown. Gene expression from available RNAseq datasets [55, 56] is indicated, as the FPKM + 1 value. The highest value for FPKM + 1 for each gene is shown. The total number of spectra that have been identified in proteomics studies is indicated.
Table S4. Primer sequences.
Sequences of the primers used to generate and sequence plasmids, perform RT-PCR, and create the plasmid RNAi library.
Acknowledgements
We thank Hannah Miller, Rene Suleiman, Scott Dawson and Su-Ju Lin for helpful discussions. We thank Lutz Froenicke for helpful advice on designing the library to enable Illumina sequencing. DNA sequencing was carried out by the DNA Technologies and Expression Analysis Cores at the UC Davis Genome Center, supported by NIH Shared Instrumentation Grant 1S10OD010786-01. A.B. was supported by a fellowship from the UC Davis Training Program in Biomolecular Technology. This work was funded by NIH grant 1R01AI146914 and a Pew Scholarship awarded to K.S.R.