Changes in the microbial community of Lubomirskia baicalensis affected by Red Sponge Disease

Sponge diseases occur globally and the resulting reduction of sponge populations has negative effects on other organisms within the ecosystems due to loss of nutrient enrichment and loss of bioremediation. In Lake Baikal, the predominate sponge species Lubomirskia baicalensis is currently being infected with an unidentified pathogen resulting in a sharp decline in population. The current hypothesis is that the recent increase in methane concentration in the lake has caused dysbiosis within the bacterial community of L. baicalensis resulting in the disease outbreak. In this study we investigated the changes in the bacterial community between healthy and sick sponges using 16S bacterial profiling targeting veritable regions 3-5. Here we present data that the bacterial communities of the healthy sponge samples were significantly different from sick samples and several poorly classified organisms were identified by Indicator Species Analysis as significant. Organisms identified from the sick samples classified within taxonomic units that contain acidophilic bacteria which suggest pH may play a role. There was also an observed decrease in the number of identified methyltropic bacteria present in the sick sponge samples compared to the healthy.

INTRODUCTION Lake Baikal in Siberia is the largest freshwater lake by volume on the Earth and is the source of 20% of all free fresh water. It is a rift lake formed by extensional tectonics between the Eurasian and Amurian plates. The lake reaches a depth of 1600 meters with a sediment layer up to 7,000 meters between the lake bed and rift bed. It is thought to be approximately 25 million years old and contains a vibrant and complex ecosystem.
The first reports of a disease outbreak in the endemic freshwater sponge Lubomirskia baicalensis of Lake Baikal were in 2016 (Denikina et al., 2016).
Diseased sponges show atypical pink lesions suggesting this is a Red Sponge disease. Prior to this, there was no historical record of disease outbreaks in this sponge population. However, disease outbreaks and the corresponding loss of sponge populations have occurred in other ecosystems and can have severe consequences for the ecosystem. Sponges are filter feeders, most consuming microorganisms in their environment. They also play roles in nutrient cycling in the water column and are an important element in the food web of aquatic ecosystems (Bell, 2008).
This focus on sponges and their ability to consume microorganisms is becoming an important topic in research. Put into the context of human pollution 2 of ecosystems, an organism able to consume bacteria can play a powerful role as a bioremediator. Sponges can function by consuming microorganisms to protect important aquatic systems from toxic plankton blooms and fecal bacteria from sewage runoff entering the water (Milanese et al., 2003). They also can be directly applied in fisheries to treat harmful microbial byproducts of industry (Zhang et al., 2010).

Sponge Bioremediation
A study of the subtropical Atlantic Ocean found bacterial loads of 2.3 x 10 5 per ml and virus-like particle loads of 5.58 x 10 6 per ml in the epipelagic layer (De Corte et al., 2010). Sponges can directly utilize both bacteria and virus-like particles as food sources, resulting in reduction in microbial loads (Bell, 2008).
Research conducted in the Florida Bay linked cyanobacteria and phytoplankton blooms with large scale loss of sponge populations (Bradley et al., 2006). This is significant because phytoplankton blooms can have many effects, ranging from fish death due to decreased oxygen levels to loss of sea grasses via a light reduction mechanism to toxic blooms that can affect humans and fisheries (Bradley et al., 2006). This concept of bioremediation has been further explored in recent aquaculture studies. One study investigated the use of Hymeniacidon perlevis to 3 reduce bacterial loads in a commercial Scophthalmus maximus fishery (Zhang et al., 2010). In this study H. perlevis was able to reduce microbial levels of fecal coliform, vibro, and other bacterial species in effluent from the fishery. Another study used a controlled laboratory environment to quantify the filtration capacity of Chondrilla nucula as up to 7x10 10 Escherichia coli cells per hour per square meter of sponge mass (Milanese et al., 2003). Microbial filtration is not a species directed mechanism. Sponges will deplete the surrounding water of all microbial organisms as they feed (Patterson et al., 1997). While sponges are able to recognize symbiotic bacteria, they do not selectively eat one non-symbiotic species over another (Nguyen et al., 2014). So in the S. maximus fishery example, the waste waters were rich in microbes from fish excrement, presumably if the same sponges were cultivated near the waste output of a brewery, they would remove yeast or other microbial effluent with similar efficiency.
The relationship between sponge bioremediation and microorganism loads in aquatic systems drives the importance of understanding the nature of sponge disease. Identification of the causative agent(s) for the sponge disease will describe the type of environmental pollution that needs to be rectified to curb the disease. A simple example would be if a microbe identified as a causative agent is also in untreated water from a sewage treatment plant. Then repairs or upgrades to the plant would resolve the disease outbreak.

Historical Sponge Disease
Sponge disease is a global occurrence given that; sponges are found in aquatic ecosystems across the globe. Like diseases in other organisms, sponge diseases are quite varied in appearance and progression. Numerous studies have been conducted trying to identify the causative agent(s), and the results describe a complex system between sponges, microbes, and the surrounding environment.
Initial studies focused on identifying specific microorganisms responsible for disease, viewing diseases in Porifera as mechanistically similar to disease in humans with pathogen A causing disease A. For example, disease in Rhopaloeides odorabile in at the Great Barrier Reef was associated with a novel α-proteobacteria (Nicole et al., 2002). In the Red Sea, microbiota recovered from diseased sponge tissue at two sites 30 km apart was assayed. Both sites showed colonization of verrucomicrobia despite their separation, indicating one potential causative agent (Gao et al., 2015).
Alternately, disease may be caused by a more global microbial shift in the ecosystem. Nutrient enrichment of an ecosystem, such as coral reefs, may be the direct cause of disease (Vega Thurber et al., 2014). While that study focuses on coral reefs, a similar situation could occur in sponge disease. Since bacterial replication is limited by nutrient availability, if nutrient constraints are lifted the 5 bacterial species will replicate until secondary mechanisms limit their growth (Schaechter et al., 1958). This change in microbial population, where formerly low concentrations of potentially pathogenic bacteria increase in abundance, may cause outbreaks of disease (Gochfeld et al., 2012). These would be situations where not a single bacterium is the causal agent, but a superinfection is responsible for disease. A homologous situation in humans would be a superinfection of Epstein-Barr Virus and the malaria parasite. When both infections occur at the wrong time, there is a synergistic inhibition of the patient's immune system resulting in death (Matar et al., 2015). While both pathogens cause distinct diseases with different symptoms, the co-infection results in a specific morbidity and mortality that only occurs during co-infection.
This concept of multiple causative agents can be addressed using massive parallel sequencing and other bacterial diversity techniques. Sponge White Patch disease affecting the Caribbean Sea sponge Amphimedon compressa, originally thought to be caused by the introduction of a sponge boring bacteria, was re-evaluated utilizing these techniques. Diseased sponges had a different microbiota than healthy sponges, with some of the species detected previously implicated in other sponge and coral diseases (Angermeier et al., 2012). This result suggests that multiple causative agents may be playing a role in this disease. Both studies were conducted in the Caribbean Sea, 6 allowing for direct links between nutrient enrichment, microbial proliferation and disease.
A sponge disease outbreak off the coast of Papua New Guinea illustrates this multiple organism mechanism of disease. Five bacterial species were isolated from diseased tissue and used to inoculate healthy sponges under laboratory conditions . Using each of the five bacterial isolates alone to inoculate healthy sponge tissue did not cause disease. It was only when all five bacterial isolates were combined into one inoculum that the same disease was observed . In the sponge disease Sponge Necrosis Syndrome, two bacterial and four fungal species were identified in the diseased tissue. It was found that inoculation of healthy sponge tissue with a mixture of one bacterial species and one fungal species together was the minimum required to cause disease (Sweet et al., 2015).
Nutrient and microbial pollution may not be the only factors in disease outbreaks. Several studies linked increasing water temperatures from climate change to disease outbreaks (Blanquer et al., 2016;Cebrian et al., 2011).
Interestingly, one commonality in sponge disease outbreaks is environmental change. This can be either through a stress mechanism on the sponge weakening the immune system or a temperature dependent sponge-symbiote interaction. An example of this in L.baicalensis is its symbiotic relationship with a dinoflagellate species which regulates the heat shock protein pathway as a 7 defense against cold (Müller et al., 2007). If warming of the lake would disrupt this symbiotic relationship, there would be sponge stress or death later when temperatures drop in winter.
Since this project is focusing on the disease of L.baicalensis, the environment of Lake Baikal must be taken into account. As direct access to the lake for water testing is not possible; this study can at best use the changes in microbiome to describe ecological changes that are promoting this disease.

Environmental Factors
Lake Baikal reaches depths over 1.5 km with a sediment layer approximately seven km deep. This sedimentary layer is rich in methane and other hydrocarbons that escape into the water and contribute to its methane load.
This can be directly observed by oily sheens on the surface of the lake and methane gas bubbling to the surface (Kapitanov et al., 2007;Schmid et al., 2007).
Several long-term studies of Lake Baikal have described the effects of climate change on this ecosystem. Over the last 60 years there has been an average increase in surface temperature, down to depths of 25 m, of 0.2 o C per decade (Hampton et al., 2008). The greatest temperature increases have been during the summer, 0.38 o C per decade, and these changes are large enough to impact winter temperatures despite it being a sub-arctic lake (Hampton et al., 2008). These changes in average temperature are not linear but have cycles of warming and cooling with data from the last decade indicate the start of a new warming period (Hampton et al., 2011). This warming, especially during winter, can place stress on organisms such as L. baicalensis that have adapted well to the seasonal variation in lake temperatures. From 1999 to 2015 Lake Baikal and the Selenga River, the main tributary of Lake Baikal, has seen a reduction in rainfall ranging from 30 to 14 mm compared agaist a wet period from 1980 to 1998 (Dabaeva et al., 2016). This has seen both a reduction in lake water level to the lowest levels in 100 years and an increase in forest fire activity in the Lake Baikal basin (Dabaeva et al., 2016).
Methane concentration in Lake Baikal has been increasing (Zakharenko et al., 2015). There are two mechanisms proposed to explain this increase: either decreasing lake levels are reducing the pressure on the sediment layer or the increasing temperature of the lake is allowing for more disassociation of gas hydrates into the water (Granin et al., 2012;Zakharenko et al., 2015). It is also possible that both mechanisms are acting synergistically. This increase in methane concentration is hypothesized to be a contributing factor to the disease outbreak in L. baicalensis (Denikina et al., 2016).
Tourism is playing an increasing economic role to the economies surrounding Lake Baikal. In 2013 the Agency on Tourism for the Irkutsk reported 9 over 1,000,000 Russian and foreign visitors (Kirillov et al., 2014). Listvyanka is one of the primary tourist destinations on the western shore of the lake and part of one of two special economic zones designated in the Baikal Valley.
Insufficient sewage processing infrastructure has resulted in green algae growth in the waters off the coast of Listvyanka (Kravtsova et al., 2014). This is a significant point source of nutrient and microbial pollution. This increased nutrient availability causes an increase in algae growth (algae blooms). When an algae bloom ends the aerobic decay of the algae results localized oxygen depletion. This in turn kills off aerobic species in the affected region.
There are also industrial factors to consider. There are plans to convert a recently closed pulp mill on the southern shore of Lake Baikal into a tourist resort center (Staff, 2014). The mill site still houses over six million tons of industrial waste that is awaiting remediation. The primary tributary into Lake Baikal, the Selenga River flows through several cities and industrial sites and also may contribute to pollution in Lake Baikal. For example, extraction of coal, gold, and wolfram along with refining sites along the Selenga River result in phosphorous and cyanide compounds entering the water system (Kasimov et al., 2017).
Each of these environmental factors could be related to the recent disease outbreak. Some of the factors would correlate to specific changes in bacterial composition in Lake Baikal. Both fecal waste contamination and increased methane concentrations result in altered microbial populations. If those populations are changing in the microbiome of sick and healthy sponges, it implicates specific environmental changes with the disease mechanism.

Microbiome Profiling
Taxonomic identification of complex multi-cellular organisms classically uses morphological features and a dichotomous key. There are some drawbacks to this approach as features can change based on season, or life cycle stage, or among isolated populations of the same species. The dichotomous key itself may be limited or overly complex (Walter et al., 2007).
For example, with bacteria the best characterized organisms are those of medical or economic significance with other organisms being less wellcharacterized (Emerson et al., 2008). .
Classically, bacteria are identified by a combination of morphology and metabolism (Emerson et al., 2008). This approach requires single organisms to be isolated from samples and then grown clonally to sufficient number for analysis. Once sufficient copy numbers have been reached, samples can be serially plated on a range of diagnostic culture media to characterize metabolism.
Samples can be mounted on microscope slides and stained against a number of phenotypical traits. The Identification is then done with a dichotomous key using the sum total of the previous results. There are a number of drawbacks to this approach. Initial culturing may be difficult as the nutritional requirements and optimal environment for the bacteria is unknown until identification is complete.
Also, some of the critical characteristics may be coded on transposable elements which can result in misidentification.
These drawbacks can be best illustrated with the historic classification of Shigella as a separate species from E. Coli. Shigella, through expression of shigatoxin, causes dysentery and is a significant human pathogen. Escherichia.
coli is also an intestinal colonizer; but depending on horizontal gene transfer, it can have varying levels of pathogenicity. Classically, due to morphological and metabolic differences, Shigella and E. Coli were taxonomically distinct. It wasn't until technological advancements allowed for rapid and lower-cost whole genome sequencing that a wider analysis of both organisms was made, and it was found that Shigella is a clade of E. coli that has acquired pathogenic elements from horizontal gene transfer (Pettengil et al., 2016).
Genetic approaches offer alternatives to standard phenotypical taxonomic classification. In general, this approach relies on a DNA sequence that has a low degree of variability among members of the same species, but significant differences at a species or genus level. This sequence can then be amplified by polymerase chain reaction (PCR), sequenced, and then compared against reference databases for a species identification. The ribosome RNA genes are currently one of the principle genes targeted for phylogenetics (Woese et al., 1990). But there are several options with rich databases that can be used. For complex multicellular eukaryotic organisms, the mitochondrial cytochrome c oxidase subunit 1 (COI) is often used (Hebert et al., 2003). For bacterial identifications, the 16S rRNA subunit is used. Fungal identification targets the two internal transcribed spacers (ITS) located between the 18S and 5.8S and 5.8S and 28S genes. For general eukaryotic identification, the 18S rRNA subunit, itself a homologue to the 16S in bacteria, is used. The 16S and 18S rRNA genes contain multiple constant and variable regions, with the constant regions being highly conserved across all phyla of bacteria and the variable regions being conserved at only a genus or species level.
The advent of next generation sequencing (NGS) technologies allow for a mixed pool of DNA to be sequenced concurrently. The drawback to this parallel sequencing is dramatically reduced read lengths when compared against Sanger-style sequencers or third generation NGS platforms. The Illumina MiSeq used in this study offers a maximum read length of 300 bp per direction.
Sequencing of amplicons larger than 300 bp is possible. There is a 600 bp theoretical maximum reading window, but a practical limit of 500-520 bp allows enough overlap to ensure alignment between the forward and reverse read. This size constraint plays a role in primer selection or primer design for microbiome profiling.

13
The full length rRNA genes are too large to be sequenced with a single primer pair on the Illumina platform. The 16S rRNA gene is over 1700 bp, and a typical primer pair would span two variable regions. To sequence the entire gene, multiple reactions would need to be run generating overlapping amplicons, and the resulting data would need to be aligned to recover the full length sequences. For microbiome profiling, a relatively small amplicon covering at most two variable regions is sufficient to recover organism data down to the genus or species level (Wang et al., 2014). However, not all variable regions will yield the same results. The specific variable regions still need to be considered, as different variable regions can yield differences in data (Rintala et al., 2017). In this study, it was found the choice of primer had the largest impact on the type of data, even more than the DNA extraction method. Because of this, multiple primers and variable regions were used in this study.
14 HYPOTHESIS 1) Changes in the microbiome, rather than a single agent, is linked to the outbreak of the pink lesion disease occurring in the L.baicalensis outbreak in Lake Baikal.
2) The microbiome changes, based on metabolic choice or organism type, will be indicative of an environmental change.
Objectives 1) Evaluate four sets of primers used in high-throughput sequencing for their ability to detect bacterial, fungal and eukaryotic populations of the L. baicalensis microbiome.
2) Determine the composition of the microbiome in healthy sponges.
3) Determine the composition of the microbiome in diseased sponges.

4) Analyze the recovered microbiome data to see what variations exist
between healthy and diseased sponges. 5) Is there an increase in methanotrohps in sick sponges compared to healthy sponges.

Samples
There were two sets of samples used in this study. The first set was three DNA samples extracted from sponge tissue collected at the Lake Baikal Research Station, part of Irkutsk State Technical University located in Siberia, Russia, and kindly provided by our collaborators. These samples were DNA extracted directly from sponge tissue, so they contained a mixture of both sponge DNA and microorganism DNA. Two of these samples were collected from healthy sponges prior to the reported outbreak of disease (PI and PII). The third sample was collected after the outbreak from a sponge that displayed the red phenotype indicative of disease (PIII). Three additional samples were provided as FASTA files sequenced with the Roche 454 system (Roche, currently discontinued). These three samples were collected after the disease outbreak and were DNA extracted from sponge tissue and amplified using a bacterial V3-V4 primer. One of these samples was from a healthy sponge (Healthy), another from the same sponge that sample PIII was collected from (PIII-454), and the last a sample of diseased sponge tissue cultured in the laboratory and rescued with anti-microbial treatments (Rescue). A general summary of all samples is included in Table 1.

Primers
Based on information from prior studies of sponge disease, four primers sets were selected to identify both organisms in both the bacterial and eukaryotic domains ( Table 2). The rRNA variable regions 3, 4 and 5 (V3, V4, V5) are the most commonly sequenced. A review comparing data recovered from a single variable region to the entire 16S sequence identified V4 as yielding the most representative data (Yang et al., 2016). Due to the constraints of the maximum read size of the amplicon, only one adjacent variable region was sequenced with V4 in a single reaction run. The 357wF/785R primer set was selected. This amplified V3 and V4. Because our samples contained a mixed population of sponge and microorganism DNA, several primer sets were selected to profile the eukaryotic species. The ITS1F/ITS2aR primer set was selected to classify the fungal community. A specific fungal primer set was used rather than simply a general eukaryote primer because the ITS genes are the target for fungal barcoding. This allows identification of fungal organisms through the fungal barcoding database. This is a specialized database containing only fungal sequences, so any sponge sequences should not be returned when classifying Table 1 Summary of samples. Samples sharing the same name were collected from the same organism. Disease state indicates if the sampled organism was healthy or diseased at the time of collection. Sequencing platform refers to the platform used to generate sequencing data. The primer sets used for sequencing are indicated by their names from RTL Genomics assay listing (http://rtlgenomics.com/s/Amplicon-Diversity-Assay-List.pdf) when appropriate. Amplicon is the brief description of the region of the gene amplified. The last primer set selected was the 515yF/926pfR. This primer is nominally a V4-V5 16S primer, but it was also described as a universal primer in a study using DNA extracted from sponge tissue (Wang et al., 2014). Universal primers amplify rRNA sequences from all three domains, potentially reducing costs of follow-up sequencing projects by up to 2/3rds, as a single reaction will yield data for bacteria, archaea, and eukaryotes.

Table 2
Primers used for microbiome profiling. All primer sequences run 5' to 3'. The type refers to the target organism for the primer. Region refers to the rRNA regions amplified by the primer set

Amplification and Sequencing
Samples were submitted to RTL Genomics of Lubbock, Texas, (rtlgenomics.com) for amplification and sequencing. Each of the three samples was amplified and sequenced using each of the four primer sets ( Table 2).
Sequencing of the amplicons was conducted on the Illumina MiSeq. Data output was two FASTA files.

Cleaning of FASTA Files
Sequencing data was received as FASTA formatted files. The first several nucleotides from each sequence downstream from the primer represent a barcode region of typically six to eight nucleotides which is used to generate the label for the header. The example below illustrates the general format. Here, the nucleotides in parenthesis would correspond to the barcode.

>SAMPLE DATA HEADER<br> (GCAATTAA)ATTAACNGYCCDVAT
Several steps need to be taken to clean the raw FASTA data prior to any analysis. The barcode region, which is not part of the amplicon, needed to be removed from all samples, as this sequence will not align with any organismal database. Also, the header needed to be converted to a meaningful sample name as this will be the eventual name of the data set containing all of the sequences for that sample. This was done in Python using a script that parses the beginning of each line of the FASTA file. Once loaded as an object, the header was truncated to "Sample.forwardprimer-", e.g. PI.357wF by deleting everything after the first "-" symbol until the line break. The barcode regions 20 were removed by deleting the first eight characters following the line break if they corresponded to the barcode sequences provided with the data files. Each cleaned line of FASTA data was then combined in a new file. The two-file format was preserved with one containing the bacterial and universal primers since both are 16S rRNA primers and one eukaryotic file with the ITS1 and 18S rRNA primers. Sequencing data provided to us by our collaborators was already cleaned and handled as a separate file.

Classification
Cleaned FASTA files were classified using the Ribosomal Database Project's RDP Classifier (Wang et al., 2007). There is a web-based application (http://rdp.cme.msu.edu/classifier/classifier.jsp) which is suitable for short sets of sequences and a downloadable Java environment classifier (http://github.com/rdpstaff/RDPTools) used for the full length files. This is a program that will align FASTA data to one of several reference databases of rRNA sequences and return taxonomic information along with a confidence level. NMDS uses an alternate approach that is iterative. After the number of axes is selected, typically two or three for visualization restrictions, distances between samples are determined, and the resulting fit of the data is calculated. This is repeated multiple times until the best fit, which minimizes a value called stress, is reached or the permutation limit is reached. Stress is reported as a fractional number with 0.3 random, 0.1 an acceptable fit and values less than 0.05 are considered good fits. However, exceptionally low stress values that are reached rapidly may indicate that sample sizes are too small for the method 24 selected. This becomes obvious when looking at the ordination as many samples will be overlapped.
Multiple distance formulae were tested, including Euclidian ( = , and Raup-Crick ( = (1 − ( )). Both twodimensional and three-dimensional solutions of the NMDS were generated to compare differences in distribution. There was a decrease in stress as the number of dimensions was increased.
The resulting ordinations will be evaluated to determine what best represented the data. Once an ordination technique is selected, it was tested against various descriptive factors to see which, if any, could be responsible for driving the differences between samples. These descriptive factors were assembled into a matrix and include the following values: sequencing platform, organism the sample was collected from, primer, if the sample was collected before or after the disease outbreak was first observed (Table 3). These variables were fit to the resulting NMDS using the envfit function which returned a p value for statistical significance and visualized with a joint plot linking samples with the same environmental variables together. Several methods were used to explore the taxa that drive the differences between the samples. First, species data was added to an NMDS. This visually shows the taxa that are driving the position of the data points in ordination space.
Second, Indicator species analysis (ISA) was used to determine the statistically significant species that make up different groupings of data. This technique uses a prior clustering of the samples, which was determined by the multivariate analysis, to determine which taxa drive the differences between meaningful sample groups. In short, this technique calculated the relative abundance and frequency of taxa in the samples, and then Monte Carlo permutations were run on the data to calculate the statistical strength of the indicator for each species.
The third method used was two-way indicator species analysis (TWINSPAN).

26
This is a divisive method using reciprocal averaging to determine a division between two groups of samples and then repeated iteratively until a threshold is met. When constructing the final table taxa that occurred within a single sample were excluded.

Taxonomic Data
A total of 217 unique taxa were recovered from the samples amplified with the general bacterial primer set, the universal primer set, and the Roche-454 data provided by our collaborators (Appendix, Table 1).
Fungal taxonomic data, using both fungal datasets in RDP classifier, classified samples into the kingdom Metazoa or subphylum Ecdysozoa and for all samples. The eukaryotic classification using the SILVA database also failed to yield any meaningful taxonomic data outside of the phylum Porifera. BLAST was used to validate these negative results by submitting random subsets of FASTA sequences from this data file and either excluding Porifera (taxid:6040) or including fungi (taxid:4751) to constrain the alignment. There was no improvement in classification, thus sequencing data for both eukaryotic primer sets was excluded from further analysis. In addition, no nonbacterial taxonomic data were recovered from the 515yF/926pfR primer set described as universal.
Samples and Sample Nomenclature

28
The remaining samples were used for further analysis. Samples were identified either by the organism they were collected from (PI, PII, PIII) or by a general descriptor (Healthy and Rescue), along with either the forward primer (357wF or 515yF) or the sequencing platform used (454). A summary of the samples is provided in Table 4.

Table 4
Sample names and a general summary of the health of the organism the sample was collected from, the sequencing platform and the variable region amplified for sequencing. The Y-axis is dissimilarity, with a lower dissimilarity value indicating higher similarity.
Comparing only samples PI, PII and PIII sequenced on the Illumina platform, they clustered into three groups of two ( Figure 1A). One cluster was by sample, PIII, which showed the least overall dissimilarity between the two primers. Interestingly, the other two clusters were grouping by primer with PI-515yF being most similar to PII-515yF and PI-357wF grouping with PII-357wF.
However, there was little similarity between clusters with the PIII being >70% dissimilar to the 357wF cluster and the 515yF cluster being an outlier. This unexpected hybrid result with the healthy samples grouping by primer rather than sample could be due to the healthy samples having more diverse microbiomes than the sick samples. With a more diverse microbiome the V3-V4 and V4-V5 primer sets could then recover different sets of taxonomic data resulting in the observed clustering. Another explanation is that a labeling error occurred. The latter explanation was ruled out by having two samples (PI_2-357wF and PII_2-357wF) re-sequenced with the 357wF/785R primer set. The re-sequenced samples were >80% similar to the initial samples, and were within the same cluster as the original samples meaning that this group is indeed most similar based on primer ( Figure 1B). By extension, this means the initial cluster pattern was correct ( Figure 1A). Lastly, both initial data sets were combined ( Figure 1C).
The data from the Roche 454 platform showed a low amount of similarity between the sick (PIII) and healthy samples. The rescue sample appears as an outlier to this group. There was little overlap between the two groups based on sequencing platform.

Species Richness
Species richness, the number of unique taxonomic groups in each data

Multivariate Analysis
The PCA determined the primary axis (x-axis) accounted for 31.6% of the sample variation with 16.2% accounted for in the second axis (y-axis). This separated the data points into two major groups along the primary axis. The laboratorytreated rescue sample was an outlier to the closely-related samples collected from untreated sponges. Variation in the second group was primarily described by the second axis with the PIII-357wF and PI-357wF samples as outliers to this group ( Figure 3). Multiple different distance methodologies were used to construct NMDS plots to evaluate which best represented the data both on final stress value and clustering of samples ( Figure 4). The Euclidean distance method resulted in a low stress ranging from 0.2 to 0.0414, below the 0.05 A three-dimensional solution using the Bray-Curtis dissimilarity was performed ( Figure 4d). This did reduce stress to being consistently under 0.05 A B D C Figure 6. An NMDS using Bray-Curtis distance with a joint plot based on disease state of the organism DNA samples were collected from (p < 0.05). Samples are color coded by disease state of the organism they were collected from. Red for diseased, green for healthy and blue for the in vitro treated and rescued sample.
again with invariable distribution of samples between trials and continuance of the separation into three groups. However, determining where the data points are in three-dimensional space was more difficult than the solutions for two dimensions. Going forward, the sample variables will be tested using a two-  Table 1 and 2). For example, Methylocystaceae has only a single sequence recovered across all three sick samples. Legionella, a pathogenic genus of note, also groups with the sick samples though it is found in higher abundance in healthy samples.
Based on the distribution of samples in ordination space samples were organized into three groups for ISA and TWINSPAN analysis. PI-515yF and PII-515yF were clustered into the healthy 1 group. The healthy 2 group consisted of TWINSPAN analysis revealed that there is a conserved group of taxa across all disease states (Table 5). Among the healthy samples the V3-V4 and V4-V5 primers recovered distinct sets of taxonomic data. The greatest differences between samples within a group occurred within the healthy 1 group. Table 5. An ordered two-way table generated by TWINSPAN. Samples are designated by number: 1 is PI-515yF, 2 is PI-357wF, 3 is PII-515yF, 4 is PII-357wF, 5 is PIII-515yF, 6 is PIII-357wF, 7 is PIII-454, and 8 is Healthy-454. These are combined into higher order groups sick, healthy 1 (H1) or healthy 2 (H2).

42
ISA was performed to obtain statistical information regarding which taxa best represented the three groups found from the environmental fitting (Table 6 all ISA data is reported in the Appendix Table 3). The ISA analysis used returned three primary values: the p value for statistical significance; A, which measures the percentage of times a taxonomic result occurs within a group, and B, which measures the percentage of times that taxonomic result occurs in the samples that made up that group. Higher order groups were also included: healthy and sick, sick and rescue, and rescue and healthy. All indicator species were present in all samples that made up their respective groups, B = 1. Two Table 6. Summary of the significant indicator species from all data sets. Sick is a combined group of all PIII samples. Rescue is the rescue data set. A is the proportion of times a taxonomic group occurred within that group. B is the frequency the taxonomic group occurred in the samples that make up that group. Pelagibacter and Terrimucrobium. These taxa all had equal P values of 0.0368 and occurrences of these taxa were within these two groups, A = 1 and B = 1.

DISCUSSION
The evaluation of the primers and the bacterial microbiome profiles recovered suggested there was a complex interaction between sample and primer set used for sequencing (Figure 1). The initial assumption, based on the clustering pattern of three groups of two samples, is that the sample was the primary driving force for similarity between profiles. We expected to see smaller differences in data when using the V3/V4 and V4/V5 primers on the same sample and larger differences when looking across samples as these primers generate overlapping data. The initial data showed that PI and PII samples showed the highest similarity by primer while the PIII group showed the highest similarity based on sample origin. The PIII group was also more similar to the PI and PII samples sequenced with the 515yF primer than the PI and PII samples sequenced with the 357wF primer. This means that the healthy samples contained a more diverse microbiome and that each primer set, V3-V4 and V4-V5, recovered a distinct range of taxonomic data. The sick samples had a much less diverse microbiome that was recovered equally with both primer sets.
Examining the species list, there are a large number of sequences that identified only to the bacterial domain for the 515yF primer set in the healthy samples. This is unlikely to be a function of the primer itself as the PIII-515yF data set has a two log reduction in the number of sequences identified to the 45 bacterial domain. This could mean the domain level classifications represent organisms not present in the RDP 16S database. This database does contain data from global sources, but Lake Baikal may be poorly represented in that database. More accurate taxonomic classification for the unclassified bacteria should be obtainable if longer sequence reads could be generated (Shin et al., 2016). Future exploration of these unclassified taxa may be an important avenue of further research since this occurs in high abundance within healthy organisms but at much lower abundance in sick samples.
The general lack of recoverable eukaryotic data was disappointing, but not unexpected. The DNA samples were generated from harvested sponge tissue, the bulk of which would be sponge cells. This high amount of sponge DNA would then mask non-sponge eukaryotic rRNA sequences. Our sequencing project only generated 100,000 sequences per reaction. That is also the most likely explanation for why only bacterial data was recovered by the 515yF/928pfR primer set. If more sequences were obtained in the reaction then non-bacterial sequences could be recovered, de-multiplexed and analyzed using other databases. Other eukaryotic primers were not tested due to constraints of DNA samples sizes.
The V4/V5 primer set does appear to recover fewer unique taxa compared to the V3/V4 primer (Figure 2). There were too few samples to perform meaningful statistical analysis on the species richness data. However, looking at the species data there were a number of instances where rare taxa, fewer than five sequences recovered, appeared in only one sample and account for a large portion of species richness. So the differences in species richness may be simply if a taxonomic group is rare enough it is either missed in sequencing or could not be present in the local area where the sample was collected. Samples with lower species richness did not cluster together with multivariate analysis, so this is unlikely to be a significant factor. The disparity in richness between samples can be addressed with access to water samples collected from sample sites. Microorganisms can be filtered from the water samples, profiled, and compared against the existing sponge data to see if the species richness reflectes the local environment. An alternate explanation is a product of primer choice (Rintala et al., 2017). The two primer sets used both amply V4. If the bulk of the sequencing data from endemic Lake Baikal bacteria was generated with V3-V4 primers, then the V4-V5 sequences recovered would contain a variable region not in any of the reference databases resulting in poor alignments at low taxonomic levels. Either explanation underscores the need for expanded sampling and use of newer sequencing platforms that enable longer reads.
Multivariate analysis was conducted with the core nine samples, omitting the samples PI and PII re-sequenced with 357wF due to their exceptional degree of similarity. Multiple ordination techniques were used to visualize the data because, while there were certain objective metrics that needed to be satisfied, e.g.low stress in NMDS, the resulting ordination still needs to be explainable.
NMDS using Raup-Crick distribution yielded an exceptionally low stress, but the resulting ordination shows the rescue sample is so unrelated to the remaining samples that there were essentially no differences between them.
The PCA analysis showed how the large differences between the rescue sample and the other samples may have obscured the relationships between the non-rescue samples. One of the sick samples, PIII-357wF, was clearly distinct, yet the other sick samples show high similarity to healthy samples ( Figure 3). So other ordination techniques need to be examined. NMDS is the current most favored for ordination techniques and was used next. There are multiple different methodologies available to calculate the differences between samples when constructing the ordination (Figure 4). This is a small data set, so a range of methodologies were tested. The different methodologies have different applications with some being appropriate for ecological data sets where species exist as a gradient. For example, the Raup-Crick method looks at the probability of a taxa being present or absent in a data set and it was unexpected that it yielded one of the worst distributions. The ordinations produced using Manhattan and Bray-Curtis dissimilarity functions were identical when overlaid on top of each other (Figure 4c). After selecting the Bray-Curtis method for generating the NMDS, a two-dimensional solution was compared to a threedimensional one. Both generated low stress solutions, but the two-dimensional 48 solution had the advantage of readability over the three-dimensional when species plots were superimposed on the ordinations. The inclusion of the species information (not shown) resulted in a plot that was too cluttered to be of use.
With an ordination that shows an ordered structure to the data, the descriptive variables were fit to the ordination and tested statistically to see if they explain the observed grouping ( Figure 5). The grouping of samples in ordination space cannot be explained by factors such as the organism the DNA was collected from, the variable regions used for sequencing, the sequencing platform used to generate data or the timing of samples collected before or after the disease outbreak occurred. These results have several important conclusions. It is appropriate in this project to compare data from different sequencers. A concern arose looking at the dissimilarity dendrograms, and there was not a profound change in the microbiome composition in the sponges following disease outbreak.
The best explanation for the grouping of samples is the health of the organism the samples were collected from ( Figure 6). This is an expected conclusion as healthy sponges would be functioning as filter feeders and would have a microbiome population that most likely reflects the microbiome of Lake TWINSPAN revealed a core group of taxa that probably contains the core microbiome reported to be transmitted vertically among conspecific sponges (Reveillaud et al., 2014). The presence of these shared taxa in samples of different disease states suggests that they are unrelated to the disease outbreak.
Analysis of samples collected prior to 2010 would allow for further clarification of the core microbiome of L. baicalensis.
Several taxa were found to be significant by ISA. These taxonomic units must be considered as independent to each other as they represent different DNA sequences that are aligned to a reference database. Thus a phylum level taxa would represent a distinct organism from a sequence that identified to the family level within that phylum and not be considered together as a group.
The significant species in the healthy 2 sample include Flavobacteriales, which is an order of bacteria that contains members that are pathogenic to humans, fish or other aquatic life, while others are non-pathogenic.
Clostridiaceae 1 is an unclassified family group within the Clostridiales order and also encompasses organisms of very different pathogenicities and environmental 50 niches. Organisms within class Clostridiales are associated with coral mucus but little is known as to the nature of that relationship (McKew et al., 2012). Given that these are significant to one of the healthy groups these are most likely unclassified bacteria that are not pathogenic.
The significant taxa for the sick samples are more interesting. The presence of Opitutus as an indicator species is surprising as this is a genus of Verrucomicrobia found in rice paddy soil. It is an obligate anerobe that produces propionate as a fermentation byproduct but very little else is known about this organism (Chin et al., 2002). Acetobacteraceae is a family of oxidative fermenters that can tolerate low acid environments but grow optimally around pH 5-6. They ferment sugars and ethanol and produce acetic acid. Acidobacteria Gp3 is a subdivision of the phylum Acidobacteria which also consists of acidophiles. Without species level identification, these are still broad enough groups that more definite conclusions cannot be made. This suggests that there may be a correlation between pH and disease that should be explored in future work.
The combined healthy 2 and sick group had several significant taxa.
Acidobacteria, which is a phylum level taxonomic group at a higher order than Acidobacteria Gp3, also includes acidophiles. The healthy 2 group contains samples collected from sponges both prior to and after the disease outbreak which supports the need to collect pH data both in sponge tissues and in the 51 surrounding water. Alcaligenaceae is a family of bacteria found in all nonextreme enviroments some of which are known to be pathogenic.
Terrimicrobium, another Verrucomicrobia, is another poorly characterized bacteria that was originally isolated in rice paddies. Pelagibacter unqiue, is the most abundant marine and freshwater bacterium on Earth (Morris et al., 2002).
Its inclusion as an indicator species is most likely due to its absence in the V4-V5 data set.