Identification and characterization of long noncoding RNAs and their association with acquisition of blood meal in Culex quinquefasciatus

The Southern house mosquito, Culex quinquefasciatus (Cx. quinquefasciatus) is an important vector that transmit multiple diseases including West Nile encephalitis, Japanese encephalitis, St. Louis encephalitis and lymphatic filariasis. Long noncoding RNAs (lncRNAs) involve in many biological processes such development, infection, and virus-host interaction. However, there is no systematic identification and characterization of lncRNAs in Cx. quinquefasciatus. Here, we report the first ever lncRNA identification in Cx. quinquefasciatus. By using 31 public RNA-seq datasets, a total of 4,763 novel lncRNA transcripts were identified, of which 3,591, 569, and 603 were intergenic, intronic, and antisense respectively. Examination of genomic features revealed that Cx. quinquefasciatus shared similar characteristics with other species such as short in length, low GC content, low sequence conservation, and low coding potential. Furthermore, compared to protein-coding genes, Cx. quinquefasciatus lncRNAs had lower expression values, and tended to be expressed in temporally-specific fashion. In addition, weighted correlation network and functional annotation analyses showed that lncRNAs may have roles in blood meal acquisition of adult female Cx. quinquefasciatus mosquitoes. This study presents the first systematic identification and analysis of Cx. quinquefasciatus lncRNAs and their association with blood feeding. Results generated from this study will facilitate future investigation on the function of Cx. quinquefasciatus lncRNAs.


Introduction
CNCI (Sun et al., 2013). Only transcripts that were categorized as noncoding by all three softwares 125 were retained for downstream analyses. We then used BLASTX to find sequence similarity of the 126 potential lncRNA transcripts against Swissprot database. Transcripts having high sequence similarity 127 (E-value < 1e-3) were discarded. We then removed transcripts that had no strand information and 128 located within less than 2 kb scaffold-end range. 129 130

Transcript quantification and expression 131
Salmon version 0.10.1 (Patro et al., 2017) was used with default parameters to compute the expression 132 of each transcript. We combined protein-coding and lncRNA transcripts together into a single FASTA 133 file when running Salmon. TPM value from Salmon was used for downstream analysis. 134 135

Validation of lncRNAs by reverse-transcribed PCR 136
About 20 adult females of VCRU-lab strain Cx. quinquefasciatus were obtained from Vector Control 137 Research Unit, Universiti Sains Malaysia. The mosquitoes were homogenized for RNA isolation and 138 subsequent reverse transcription for cDNA synthesis. For RNA isolation, Rneasy Mini Kit (Qiagen,cat. 139 no: 74104) was used according to manufacturer's protocol. Reverse transcription was carried out using 140 iScript Reverse Transcription Supermix (Bio-Rad Laboratories, California;cat. no. 1708840) according 141 to manufacturer's protocol. We then performed PCR using the cDNA as templates. We designed PCR 142 primers that span exon-exon junction using Primer Blast (https://www.ncbi.nlm.nih.gov/tools/primer-143 blast/). Primers used in this study is listed in Supplementary Data 2. 144 145

Expression specificity 146
We used JS divergence method to compute expression specificity of lncRNAs and protein-coding 147 genes. MATLAB version R2018b was used to calculate JS score using the formula previously 148 described (Cabili et al., 2011). 149

Co-expression analysis 151
WGCNA version 1.4.6 was used to perform co-expression analysis between lncRNAs and protein-152 coding genes (Langfelder and Horvath, 2008;Zhang and Horvath, 2005). We computed variance of 153 each lncRNA and protein-coding gene, and ranked them in descending order. We only picked top 5000 154 genes for co-expression analysis. 155 156 Functional annotation 157 g;Profiler was used for functional annotation and enrichment analysis (Reimand et al., 2007). g:SCS 158 threshold was used for multiple testing correction. 159 160

Identification of Cx. quinquefasciatus lncRNAs 162
In this study, we used 31 public RNA-seq datasets which were composed of approximately 849 163 million reads as raw inputs for lncRNA prediction. Public RNA-seq datasets used in the study were 164 listed in Supplementary Data 1. Pipeline for lncRNA prediction can be found in Figure 1. Each 165 library (paired or single-end) was individually mapped against Cx. quinquefasciatus genome (CpipJ2, 166 VectorBase) using splice-aware aligner tool, Hisat2 (Kim et al., 2015). Each alignment file was then 167 used for transcriptome assembly using Stringtie, and all 31 Stringtie output files were merged using 168 Stringtie merge option (Pertea et al., 2015). We then compared our merged transcriptome assembly 169 After transcriptome assembly, we then performed stringent transcript filtering steps (Figure 1). 175 Transcripts of less than 200 bp were discarded. We only chose transcripts with class code "i", "u", and 176 "x", all of which were given by gffcompare. Class code "i", "u", and "x" denote intronic, intergenic, 177 and antisense to reference annotation respectively (https://github.com/gpertea/gffcompare). We 178 discovered that a total of 11,467 transcripts were denoted as either "i", "u" or "x", and all of them were 179 at least 200 bp in length. We then used TransDecoder (Haas et al., 2013) to predict long open-reading 180 frame (ORF) within these transcripts. Out of 11,467, 2,994 transcripts were discarded because they 181 were predicted to have ORF of more than 100 amino acids. 182 The remaining 8,473 transcripts were then evaluated for coding potential using CPC2 (Kang et 183 al., 2017), PLEK (Li et al., 2014), and CNCI (Sun et al., 2013), and those classified as "noncoding" by 184 all three algorithms were retained for downstream analysis. A total of 7,413 transcripts were found to 185 be predicted as noncoding by all three softwares. To avoid false positive prediction, we used BLASTX 186 (E-value cut-off 10 -3 ) to find sequence similarity of those 7,413 transcripts with known proteins in 187 Swissprot. The number of transcripts that have no significant BLASTX are 7,025. Finally, we removed 188 unstranded transcripts, and transcripts located within less than 2kb scaffold-end range on the same 189 strand. By applying this pipeline, we identified a total of 4,763 novel lncRNA transcripts, which 190 derived from 4,037 distinct loci in the genome. From these 4,763 lncRNA transcripts, 3,591 of them 191 were intergenic, while the remaining 569 and 603 transcripts were intronic and antisense to reference 192 gene respectively. List of lncRNAs Cx. quinquefasciatus and their corresponding genomic loci can be 193 found in Supplementary Data 3. We randomly selected 5 novel lncRNAs, and validated them through 194 reverse-transcribed PCR (RT-PCR) using specific primers that span exon-exon junction 195 (Supplementary Figure 1). 196

Cx. quinquefasciatus lncRNAs have potential roles in blood meal acquisition 247
To investigate the possible involvement of lncRNA in blood meal acquisition, we analyzed 248 transcriptome data by Reid et al. (2015), which aimed to identify genes, especially protein-coding 249 genes, that might be crucial for taking and processing the blood meal in adult female Cx. 250 quinquefasciatus (Reid et al., 2015). They generated paired-end RNA-seq libraries of seven different 251 time points after eclosion (2,12,24,36,48,60, and 72 hours) of adult females in order to examine 252 temporal gene expression of post eclosion and prior to blood-feeding (Reid et al., 2015). They showed 253 that adult females Cx. quinquefasciatus need at least 48 hour post-eclosion before they can take their 254 first blood meal (Reid et al., 2015). Therefore, prior to 48 hours, it was hypothesized that genes 255 necessary for blood meal acquisition would be differentially expressed, and after 48 hours, the 256 differentially expressed genes would be important for processing the blood meal. We hypothesized that, 257 besides protein-coding genes, lncRNAs in Cx. quinquefasciatus would also play regulatory roles in 258 blood-feeding process. 259 Weighted Gene Correlation Network Analysis (WGCNA) has always been used to predict 260 potential roles of lncRNAs in many species (Azlan et al., 2019a;Langfelder and Horvath, 2008;Wu et 261 al., 2016;Zhang and Horvath, 2005). WGCNA can be used to search for clusters or modules of highly 262 correlated genes, which in this case, correlation between lncRNAs and protein-coding genes 263 (Langfelder and Horvath, 2008;Zhang and Horvath, 2005). To predict the potential roles of lncRNAs 264 in blood meal acquisition, we used RNA-seq data from Reid et al. 2015, and performed WGCNA 265 analysis on lncRNAs and protein-coding genes (Reid et al., 2015). WGCNA analysis yielded a total of 266 21 modules, and 7 of them exhibited strong and statistically significant correlation (correlation > 0.8, 267 p-value < 0.05) with specific time points (Figure 5). Module size ranged from 22 to 611, and the 268 number of protein-coding genes and lncRNA in each module can be found in Table 1. 269 Two modules were found to be significantly associated with 2 hours post-eclosion, and both of 270 them had the among the largest number of genes within the clusters (Table 1). Functional annotation 271 using g:Profiler of these two modules revealed that lncRNAs at 2 hours had potential roles in diverse processes such as compound metabolic pathways, biosynthesis, catabolic process, and proteolysis 273 (Supplemental Data 4). At 12 hour post-eclosion, several genes such as salivary proteins and 274 cytochromes were found to be differentially expressed (Reid et al., 2015). We next asked if the same set 275 of genes with similar functions could be found in our magenta module that was highly associated with 276 12 hours post-eclosion (correlation: 0.97, p-value: 2e-4). We found that several protein-coding genes 277 within magenta module were salivary proteins and cytochromes, such as proline-rich salivary peptide 278 (CPIJ007453), threonine-rich salivary mucin (CPIJ010046), cytochrome c oxidase assembly factor 279 (CPIJ010284) and cytochrome P450 (CPIJ011837). Since 43 lncRNAs were clustered together with 59 280 protein-coding genes within magenta module, we hypothesized that this set of lncRNAs may 281 participate in the same function as protein-coding genes. Therefore, results from WGCNA corroborated 282 with previous findings (Reid et al., 2015), and suggest that lncRNA may intimately function together 283 with protein-coding genes in order to physiologically prepare adult females Cx. quinquefasciatus for 284 blood feeding and blood meal processing. 285 286

Discussion 287
In this study, we performed genome-wide identification and characterization of lncRNAs in Cx. 288 quinquefasciatus genome using a total of 31 relatively high-depth publicly available RNA-seq libraries. 289 We presented a set of 4,763 novel lncRNAs transcripts which derived from 4,037 loci in Cx. 290 quinquefasciatus genome. RNA-seq libraries used here were generated from many tissues and 291 developmental stages, making our prediction pipeline to be relatively robust and comprehensive. We 292 applied stringent filtering steps in our pipeline to reduce false positive and false negative prediction of 293 non-coding transcripts (Azlan et al., 2019a(Azlan et al., , 2019bEtebari et al., 2016;Nam and Bartel, 2012;Wu et 294 al., 2016). The filtering steps include using more than one coding potential assessment algorithms and 295 removing transcripts with no strand information (Figure 1). Taken together, lncRNA prediction 296 pipeline used in this study is comparably equivalent to previous studies in other organisms, and it is 297 Analysis of genomic features revealed that lncRNAs identified in our study showed known 298 characteristics of lncRNAs from other species. The characteristics are shorter in size, low sequence 299 conservation, and lower GC content. The majority of Cx. quinquefasciatus lncRNAs shared high 300 degree of sequence similarity with Ae. albopictus and Ae. aegypti, followed by An. gambiae and D. 301 melanogaster. This lncRNA sequence similarity pattern concur with previous phylogenetic relationship 302 study on known mosquitoes which showed that Ae. albopictus and Ae. aegypti are more evolutionarily 303 closer to Cx. quinquefasciatus than An. gambiae (Chu et al., 2018). A total of 10 Cx. quinquefasciatus 304 lncRNAs shared high sequence similarity with all insects (Figure 3b), suggesting that they are 305 evolutionarily conserved and may play vital roles in insect development or basic cellular functions. 306 Investigation on the expression across different time points post-eclosion revealed that, 307 compared to protein-coding genes, Cx. quinquefasciatus lncRNAs were more temporally specific and 308 have lower expression level. This high temporal specificity indicates that lncRNAs have a smaller time 309 window of expression that protein-coding genes. Even though lncRNAs have lower expression that 310 protein-coding genes, their temporal specificity in expression suggests that they putatively execute 311 specific biological functions at specific time point in development. 312 Furthermore, co-expression analysis by WCGNA revealed that Cx. quinquefasciatus lncRNAs 313 were significantly correlated with specific time-point of pre-blood feeding immediately after eclosion, 314 suggesting that lncRNAs may possess potential roles in blood meal acquisition by adult female 315 mosquitoes. This blood meal acquisition by adult female is physiologically important as it is not only 316 required for reproduction and mating, but it serves as a gateway for pathogens, making Cx. 317 quinquefasciatus one of the most harmful mosquito vectors on Earth. However, the data presented here 318 is simply descriptive, and it does not provide empirical evidence via experimentation on the specific 319 mechanisms of lncRNA functions in blood meal acquisition. We believed that results generated in this 320 study can be a starting point for dissecting the mechanism of Cx. quinquefasciatus lncRNAs functions. 321 In conclusion, this research provides the first comprehensive annotation and characterization of Cx. quinquefasciatus lncRNAs, and their putative roles in blood-feeding. We believed that data presented 323 here will be a valuable resource for genetics and molecular studies of ncRNAs in Cx. quinquefasciatus. 324 325 Acknowledgment 326 We would like to thank all our collaborators and colleagues for the discussion and the work conducted 327 in this lab. We would especially like to thank the Vector Control Research Unit, Universiti Sains 328 Malaysia for providing the mosquito samples. This study was funded by the Universiti Sains Malaysia Distribution of specificity score of lncRNAs and protein-coding genes. 346