New tools for diet analysis: nanopore sequencing of metagenomic DNA from rat stomach contents to quantify diet

Using metagenomics to determine animal diet offers a new and promising alternative to current methods. Here we show that rapid and inexpensive diet quantification is possible through metagenomic sequencing with the portable Oxford Nanopore MinION. Using a simple amplification-free approach, we profiled the stomach contents from wild-caught rats. We conservatively identified diet items from over 50 taxonomic orders, ranging across nine phyla that include plants, vertebrates, invertebrates, and fungi. This highlights the wide range of taxa that can be identified using this simple approach. We calibrate the accuracy of this method by comparing the characteristics of reads matching the ground-truth host genome (rat) to those matching diet items. We also suggest a means to correct for biases in metagenomic approaches that arise due to the paucity of genomic sequence in databases as compared to mitochondrial DNA or rDNA. Finally, we implement a constrained ordination analysis to show that it is possible to identify the sampling location of an individual rat within tens of kilometres based on diet content alone. This work establishes long-read metagenomic methods as a straightforward and robust approach for diet quantification. It considerably simplifies the workflow and avoids many inherent biases as compared to metabarcoding. Continued increases in the accuracy and throughput of Nanopore sequencing, along with improved genomic databases, means that this approach will continue to improve in accuracy.

Introduction data to test whether changing these default penalties affected the results (gapopen 1, 162 gapextend 1). We found that these adjusted parameters did not qualitatively change our 163 results. 164 We assigned sequence reads to specific taxon levels using MEGAN6 (v.6.11.7 June 165 2018) (Huson et al., 2016). We only used reads with BLAST hits having an e-value of 166 1x10 -20 or lower (corresponding to a bit score of 115 or higher) and an alignment length 167 of 100 base pairs or more. To assign reads to taxon levels, we considered all hits 168 having bit scores within 20% of the bit score of the best hit (MEGAN parameter Top 169 Percent). 170 Multivariate analyses 171 Multivariate analyses were done using the software PRIMER v7 (K. R. Clarke & Gorley, 172 2015). The data used in the multivariate analyses were in the form of a sample-(i.e. 173 individual rat) by-family matrix of read counts. All bacteria, rodent, and primate families 174 were removed. The majority of rodent hits were to rat and mouse, resulting from the 175 rats' own DNA (see below). The majority of the primate hits were to human sequences, 176 which likely resulted from sample contamination. 177 The read counts were converted to proportions per individual rat, by dividing by the 178 total count for each rat, to account for the fact that the number of reads varied  186 We used unconstrained ordination--specifically, non-metric multidimensional scaling 187 (nMDS) applied to the dissimilarity matrix--to examine the overall patterns in the diet 188 composition among rats. To assess the degree to which the diet compositions of rats 189 were distinguishable among the three locations, we applied canonical analysis of 190 principal coordinates (CAP) (Anderson & Willis, 2003) to the dissimilarity matrix. CAP is 191 a constrained ordination which aims to find axes through multivariate data that best 192 separates a priori groups of samples (in this case, the groups are the locations from 193 which the rats were sampled); CAP is akin to linear discriminant analysis but it can be We used Similarity Percentage (SIMPER; (K. R. Clarke, 1993)) to characterise and 197 distinguish between the locations. This allowed us to identify the families with the 198 greatest percentage contributions to (1) the Bray-Curtis similarities of diets within each 199 location (Table S3) and (2) the Bray-Curtis dissimilarities between each pair of locations 200 ( Table S4). 201 Results 202 DNA sequencing and assignment of reads to taxa 203 After DNA isolation and sequencing, we obtained a total of 82,977 reads from the traditional divisions of GenBank excluding genome survey sequence, EST, high-218 throughput genome, and whole genome shotgun 219 (ftp://ftp.ncbi.nlm.nih.gov/blast/db/README)) and the NCBI other_genomic database 220 (RefSeq chromosome records for non-human organisms 221 (ftp://ftp.ncbi.nlm.nih.gov/blast/db/README)). We used BLAST as it is generally viewed 222 as the gold standard method in metagenomic analyses (McIntyre et al., 2017). Of the 223 133,022 barcoded reads, 30,535 (23%) hit a sequence in the combined nt and 224 other_genomic database at an e-value cutoff of 1e-2.

225
As an initial assessment of the quality of these hits, we examined the alignment lengths 226 and e-values. We found a bimodal distribution of alignment lengths and a highly skewed 227 distribution of e-values (Fig. 3A). We hypothesized that many of the short alignments 228 with high e-values were false positives. We thus first filtered this hit set, only retaining 229 BLAST hits with e-values less than 1e-20 and alignments greater than 100 bp. Similar 230 quality filters have been imposed previously (Srivathsan, Sha, Vogler, & Meier, 2015). A 231 total of 22,154 hits passed this filter (Datafile S1). Mean read quality had substantial 232 effects on the likelihood of a read yielding a BLAST hit, with almost 40% of high 233 accuracy read having hits in the March dataset, as compared to 1% of low accuracy hits 234 (Fig. 3B). 235 To specifically assign each sequence read to a taxon, we analysed the BLAST results in 236 MEGAN6 (Huson et al., 2016). The algorithm employed in MEGAN6 assigns reads to a 237 most recent common ancestor (MRCA) taxon level. For example, if a read has BLAST 238 hits to five species, three of which have bit scores within 20% of the best hit, the read will be assigned to the genus, family, order, or higher taxon level that is the MRCA of 240 those best-hit three species (Huson, Auch, Qi, & Schuster, 2007). If a read matches one 241 species far better than to any other, by definition, the MRCA is that species. 242 5,334 reads (24%) were not assigned to any taxon by Megan. Of the remainder, 31% 243 were assigned by MEGAN as being bacterial. 55% of these were Lactobacillus spp. 244 These results match previous studies on rat stomach microbiomes, which have found 245 lactobacilli to be the dominant taxa (Brownlee & Moss, 1961;Horáková, Zierdt, & 246 Beaven, 1971;Li et al., 2017;Maurice et al., 2015). Plant-associated Pseudomonas and 247 Lactococcus taxa were also common, at 7% and 6%, respectively.

248
MEGAN assigned reads to a wide range of eukaryotic taxa. To conservatively infer 249 taxon presence, we first reclassified MEGAN species-level assignments to the level of 250 genus. However, after this, many clear false positive assignments remained (e.g. hippo 251 and naked mole rat). These matches were generally short and of low identity. To reduce 252 such false positive taxon inferences, we used information from reads assigned to the 253 genera Rattus (rat) and Mus (mouse). We inferred that the reads assigned to Rattus 254 (2,696 reads in total) were true positive genus-level assignments and that the reads 255 assigned to Mus (2,798 reads in total) were false positive genus-level assignments (and 256 not true positive Mus-derived reads). Although rats are known to prey on mice 257 (Bridgman, Innes, Gillies, Fitzgerald, & King, 2013), if this had occurred, we would 258 expect that (1) the ratio of mouse to rat reads would be higher in the subset of rats that 259 had predated mice; (2) in those same rats, the percent identity of the reads assigned to the ratio of mouse to rat reads was similar for all rats. In addition, there was no 262 evidence of higher percent identities for Mus reads from rats that had higher ratios. 263 Notably, the mean percent identity values of the best BLAST hits for Rattus and Mus 264 reads differed substantially, with Rattus reads having a median identity of 86.4%, and 265 Mus 81.0% (Fig. 4A). The mean percent identity for Rattus reads corresponds very well 266 to that expected given the mean quality scores of the reads (assuming the true 267 sequence of the read is 100% identical to Rattus, 86.4% identity corresponds to a 268 mean quality score of 8.7; Fig. S2A-C). There was also a clear difference in the 269 alignment lengths: the median ratio of alignment length to read length was 0.57 for 270 Rattus and 0.52 for Mus (Fig. 4B). We note that read identity and the ratio of alignment 271 length to read length are positively correlated ( Fig. S2G-I). There is little correlation 272 between read identity and alignment length alone ( Fig. S2D-F). 273 Importantly, the majority of diet items have percent identities that overlap with the 274 Rattus reads, and alignment length to read length ratios that often exceed the Rattus 275 reads. This suggests that many diet taxa assignments are correct down to the level of 276 genus (as the Rattus-assigned reads are correct to the level of genus). However, to 277 further decrease false positive taxon assignments of diet items, we implemented cut-278 offs based on the characteristics of the Mus-and Rattus-assigned reads. For genus-279 level assignment, we required at least 82.5% identity and an alignment length to read 280 length ratio of at least 0.55. These cutoffs exclude 88% of the reads falsely assigned to 281 Mus, instead assigning them correctly to one taxon level higher, the Family Muridae.
For family-level assignments, we required 77.5% identity, an alignment length to read 283 length ratio of at least 0.1, and a total alignment length of at least 150 bp. Using higher 284 cutoffs for the ratio of alignment length to read length excluded a large number of likely 285 true positive taxa for which only short mtDNA or rDNA database sequences were 286 present in the databases. For all other read-to-taxon assignments, we placed the read 287 at the level of Order, or used the taxon level assigned by MEGAN. Using these cutoffs,

288
16% of all reads were classified at the Genus level; 71% were classified at the Family-289 level or below; 89% were classified at the Order-level or below; and 98% were classified 290 at the Phylum-level or below. to genomic DNA. By quantifying this fraction for species with complete genome 326 sequences in the database and species without complete genomes we aimed to assess 327 the effects of this bias.

328
For the majority of animals with sequenced genomes in the database, we found that the 329 fraction of reads that mapped genomic sequence ranged from 61% (Gallus) to 73% 330 (Rattus) to 100% (Coturnix and Numida) (Fig. 7). We hypothesise that this variation is 331 likely due to the type of tissue sequenced. For Rattus the sequenced tissue was 332 primarily stomach muscle, which has a relatively high fraction of mtDNA; for Coturnix 333 and Numida it may have been eggs. For plants with sequenced genomes, the fraction of 334 reads matching genomic sequence was generally higher: between 88% (Zea) and 98% 335 (Cenchrus).

336
In contrast, for genera with little or no genomic sequence in the database, the vast 337 majority of matches were solely to mtDNA, rDNA, or microsatellite loci: 90% of Phoenix 338 (date palm) hits; all Helix (snail); and all Rhaphidophora (cave weta) hits. All Artioposthia 339 (New Zealand flatworm) hits were to rDNA. These results indicate that for genera with 340 no genomic sequence data, we have underestimated the actual number of sequences 341 from that taxon by approximately three-to twenty-fold (for animals and plants, 342 respectively). It is difficult to determine how these numbers correlate with biomass.

343
Close examination of the sequence classification data suggested that specific families 344 (and orders) were overrepresented in the diets of rats from particular locations. For 345 example, six out of eight rats from the native estuarine bush habitat (OB) consumed consumed Phaseanidae were from the native estuarine habitat (OB). All five rats that 348 consumed Solanales were from the restored wetland area. These patterns suggested 349 that it might be possible to use diet components alone to pinpoint the habitat from which 350 each rat was sampled.

nMDS and CAP analysis by location 352
In order to determine if diet composition of the rats differed consistently between 353 locations, we first performed an unconstrained analysis using nMDS on taxa assigned at 354 the family level. Using family rather than order or genus provides a balance between 355 how precisely we identify the taxon of diet item (genus, family, order), and whether we 356 assign a taxon at all. While family-level assignments are less precise than genus-level, 357 only 16% of all reads were classified at the genus level, while 71% were classified at the 358 family level.

359
The family-level unconstrained ordination (nMDS) showed no obvious grouping of rats 360 with respect to the locations (Fig. 8a), indicating that locations did not correspond to the 361 predominant axes of variation among the diets. However, a constrained ordination 362 analysis (CAP) identified axes of variation that distinguished the diets of rats from 363 different locations (Fig. 8b). We found that the CAP axes correctly classified the 364 locations of 19 out of 24 (79%) rats using a leave-one-out procedure. The families 365 having the largest correlations with the first two principal coordinates, and most 366 responsible for the separation between groups, were primarily plants: Arecaceae, Podocarpaceae, Piperaceae, and Pinaceae. In addition, insect groups (Cerambycids 368 and Formicids) and birds (Phaseanidae and Numididae) played a role (Fig. 8c). 369 The families driving similarity within the three locations (i.e., had the greatest within-370 location SIMPER scores) varied among locations. LB had average Bray-Curtis within-371 location similarity of 13%; mostly attributable to Hymenolepidae (accounting for 51% of 372 the within-group similarity), Solanaceae (11%), and Fabaceae (11%). The average 373 similarity for OB was 21%, with the greatest contributing taxa being Arecaceae (33%), 374 Poaceae (23%), Fabaceae (9%), and Phasianidae (8%). The average similarity for WP 375 was 24%, with the greatest contributing taxa being Poaceae (72%) ( Table S4). 376 Discussion 377 Accuracy and sensitivity 378 Here we have shown that using a simple metagenomic approach with error-prone long 379 reads allows rapid and accurate classification of rat diet components. We expect that 380 this technique can be used to infer diet for a wide variety of animal and sample types, 381 including samples that use less invasive collection methods, such as fecal matter. The 382 sensitivity of this approach will likely improve as the accuracy and yield of Oxford 383 Nanopore sequencing increases. The analysis here is based on less than 200,000 reads 384 from two flow cells. The rapid improvement of this technology is such that current 385 yields are often far in excess of two million reads per flow cell. The method will also 386 improve as the diversity of taxa in genomic sequence databases increases. Several 387 aspects of the data support this. 388 First, we note that we did not find BLAST hits for the majority of reads. This is partially 389 due the relatively low accuracy of the Oxford Nanopore sequencing platform at the time 390 these data were collected (approximately 87%). However, the fraction of reads yielding 391 hits in the database increased substantially for higher quality reads, approaching 40% 392 for very high quality reads (Fig. 3b). Other factors also likely reduce the numbers of 393 BLAST hits, such as the paucity of genome sequence data for many taxa. This is 394 convincingly illustrated by comparing across taxa the fraction of genomic hits to 395 mitochondrial or rDNA sequence hits. considerably decreases diet resolution given that for some taxa, only a small percentage 420 of sequence reads derive from the mitochondria as opposed to the nuclear genome.

421
It is also important to note that our interest in diet also includes resolving relative 422 biomass and relative numbers of each prey species, neither of which necessarily 423 correlate well with the amount of DNA (either mitochondrial or nuclear) purified from a 424 sample. Even a simple correction for the fraction of reads matching mitochondrial versus 425 nuclear genomes is difficult, as different plant and animal tissues differ considerably in 426 the relative amounts of mitochondrial versus nuclear DNA (e.g. leaf versus fruit).
We found that rats consumed many soft-bodied species (e.g. mushrooms, flat worms, 429 slugs, and lepidopterans) that would be difficult to identify using visual inspection of 430 stomach contents. Achieving data on such a wide variety of taxa would be difficult to 431 quantify using other molecular methods, as there are no universal 18S or COI universal 432 primers capable of amplifying sequences in all these taxa. While it might be possible to 433 use primer sets targeted at different phyla or orders, quantitatively comparing diet 434 components across these using sequences amplified with different primer sets is 435 extremely difficult due to differences in primer binding and PCR efficiency.  (Kamenova et al., 2017). Finally, we suggest that our approach of standardising the read 448 counts by sample, followed by an optional transformation such as square root and dissimilarity-based multivariate ordination, offers a useful analytical pipeline for 450 analysing metagenomic diet-composition data. 451 We note that modifications to our approach might further increase the precision of our 452 ability to infer community composition. Any error-prone long read dataset (i.e. PacBio or 453 ONT) has both short (e.g. 500 bp) and long (e.g. 5000 bp) reads, as well as high quality 454 (e.g. mean accuracy greater than 90%) and low quality (e.g. mean accuracy less than 455 80%) reads. When inferring community composition, a null expectation is that taxa 456 should be equally represented by long, high quality reads as they are by short, low Here we have shown that a rapid error-prone long read metagenomic approach is able 465 to accurately characterise diet taxa at the family-level, and distinguish between the diets 466 of rats according to the locations from which they were sourced. This information may 467 be used to guide conservation efforts toward specific areas and habitats in which native 468 species are most at risk from this highly destructive introduced predator.