ABSTRACT
Very little is currently known about the major histocompatibility complex (MHC) region of cynomolgus macaques (Macaca fascicularis; Mafa) from Chinese breeding centers. We performed comprehensive MHC class I haplotype analysis of 100 cynomolgus macaques from two different centers, with animals from different reported original geographic origins (Vietnamese, Cambodian, and Cambodian/Indonesian mixed-origin). Many of the samples were of known relation to each other (sire, dam, and progeny sets), making it possible to characterize lineage-level haplotypes in these animals. We identified 52 Mafa-A and 74 Mafa-B haplotypes in this cohort, many of which were restricted to specific sample origins. We also characterized full-length MHC class I transcripts using Pacific Biosciences (PacBio) RS II single-molecule real-time (SMRT) sequencing. This technology allows for complete read-through of unfragmented MHC class I transcripts (~1,100 bp in length), so no assembly is required to unambiguously resolve novel full-length sequences. Overall, we identified 313 total full-length transcripts in a subset of 72 cynomolgus macaques from these Chinese breeding facilities; 131 of these sequences were novel and an additional 116 extended existing short database sequences to span the complete open reading frame. This significantly expands the number of Mafa-A, Mafa-B, and Mafa-I full-length alleles in the official cynomolgus macaque MHC class I database. The PacBio technique described here represents a general method for full-length allele discovery and genotyping that can be extended to other complex immune loci such as MHC class II, killer immunoglobulin-like receptors, and Fc gamma receptors.
INTRODUCTION
Cynomolgus macaques (Macaca fascicularis; Mafa) are native to much of continental and insular Southeast Asia, but there are no feral cynomolgus macaques living in China. However, Chinese breeding facilities are one of the largest exporters of cynomolgus macaques for use in biomedical research. According to the Centers for Disease Control and Prevention, China provided over 70% of the total cynomolgus macaques imported into the United States in fiscal year 2014. These animals are derived from other geographic populations of cynomolgus macaques, most likely from Thailand, Laos, Vietnam, Cambodia, and Malaysia, but also possibly including animals from Indonesia and the Philippines. Cynomolgus macaques from Chinese breeding facilities therefore are genetically heterogeneous, and this heterogeneity may be particularly biologically relevant in highly diverse genomic regions like the major histocompatibility complex (MHC).
The MHC is both highly complex and important – the gene products encoded by the MHC present peptides to T-cells, playing an essential role in immune response to pathogens and other non-self peptides (Bontrop 2006). In macaques, the MHC class I region is highly polymorphic and extensively duplicated (Daza-Vamenta et al. 2004; Otting et al. 2005), with different chromosomes carrying different numbers of functional Mafa-A and Mafa-B genes. Many investigators have examined the MHC class I region of cynomolgus macaques from various origins (Budde et al. 2010; Campbell et al. 2008; Kita et al. 2009; Lawrence et al. 2012; Ling et al. 2012; Otting et al. 2007, 2009, 2012; Pendley et al. 2008; Saito et al. 2012; Shiina et al. 2015; Uda et al. 2004, 2005; Zhang et al. 2012; Zhuo et al. 2011), but only a limited number of studies have looked at cynomolgus macaques from Chinese breeding facilities (Krebs et al. 2005; Westbrook et al. 2015).
Here we describe a comprehensive study of 100 samples collected from two different breeding facilities in China that include individuals reported to be of Vietnamese, Cambodian, and mixed Cambodian/Indonesian origins. Many of the samples were related to at least one other animal in the study, allowing us to characterize lineage-level haplotypes based on shared Illumina short-read transcript sequences. We then performed full-length MHC class I transcript analysis on a subset of the samples using the PacBio long-read single-molecule real-time (SMRT) sequencing system. This analysis improved on a previous MHC class I study using PacBio circular consensus SMRT sequencing (Westbrook et al. 2015) by using more advanced sequencing chemistry, and a data analysis pipeline focused on the highest quality reads that did not need to account for sequencing reads with insertion/deletion errors.
MATERIALS AND METHODS
Animals
A total of 100 whole blood samples were collected from two different breeding facilities in China. Thirty samples from reported Vietnamese-origin animals were obtained from Yulin Hongfeng Experimental Animal’s Domesticating and Breeding Center (Guangxi, China). A total of 70 samples were obtained from Hainan Jingang Laboratory Animal Co., Ltd. (Hainan, China) - 57 samples were obtained from reported Cambodian-origin animals, and 13 samples were from mixed Cambodian/Indonesian-origin animals. Wherever possible, sire/dam/progeny trio sets or parent/progeny duos were collected to obtain samples that share MHC haplotypes that are identical by descent; a total of 71 samples across both facilities were directly related to at least one other animal in the cohort. Samples were collected into TRIzol Reagent (ThermoFisher Scientific, Waltham, MA, USA) to stabilize the total RNA for shipment from China.
RNA isolation, cDNA synthesis, and PCR preparation
Total RNA was isolated from the blood samples and cDNA was produced from the isolated RNA using the RevertAid First Strand cDNA Synthesis Kit (ThermoFisher Scientific) following manufacturer’s protocols. For Illumina MiSeq genotyping, a 195 bp amplicon spanning the most polymorphic region of MHC class I molecules was generated using primers designed in conserved regions of exon 2, as shown in Supplemental Fig. 1. Amplification was performed using a 4-primer system to add unique barcodes to each sample, using consensus sequence adapters and barcodes designed by the Fluidigm Corporation (San Francisco, CA, USA), as detailed in Supplemental File 1.
For PacBio RS II full-length MHC class I transcript sequencing of 72 samples from the full cohort, an ~1,150 bp amplicon with primers located in the 5‘UTR and 3’UTR of MHC class I sequences was generated as shown in Supplemental Fig. 2. A cocktail of two forward primers and three reverse primers was used to maximize homology with the majority of MHC class I sequences. A unique set of barcoded primers was used for each sample, with the barcodes integrated onto the ends of the sequence-specific oligos at synthesis. Barcodes are available from Pacific Biosciences (Menlo Park, CA, USA). Primer sequences are shown in Supplemental Fig. 2, and PCR conditions are detailed in Supplemental File 1.
All PCR products were purified using the AMPure XP PCR purification kit (Agencourt Bioscience Corporation, Beverly, MA, USA) and quantified using the Quant-iT dsDNA HS Assay kit and a Qubit fluorometer (Invitrogen, Carlsbad, CA, USA), following the manufacturer’s protocols.
Sequencing
The MHC class I exon 2 genotyping amplicons were sequenced on an Illumina MiSeq instrument (San Diego, CA, USA), and the full-length MHC class I transcripts were sequenced on a PacBio RS II instrument (Menlo Park, CA, USA) following the respective manufacturer’s protocols. Additional sequencing details are provided in Supplemental File 1, and an overview of the PacBio sequencing process is shown in Supplemental Fig. 3.
Data analysis
Processing of the MHC class I exon 2 genotyping amplicon was performed as described previously for MHC class II amplicons (Karl et al. 2014) using a custom-built command line pipeline to merge reads, trim primers, and compare MiSeq reads against a database of known cynomolgus macaque transcripts. Mafa-A and Mafa-B lineage-level haplotypes were determined for each of the 100 samples essentially as previously described (Heimbruch et al. 2015; Karl et al. 2013), first examining the sire/dam/progeny trio sets, then the parent/offspring duos, to look for sets of co-inherited sequences. Haplotypes for the unrelated animals in this cohort were inferred based on similarity to those observed in the related sets of animals.
A custom semi-automated pipeline was also developed for analysis of the PacBio MHC class I full-length transcripts; an overview of this pipeline is shown in Supplemental Fig. 4. Full details of the pipeline are available in Supplemental File 1, and the scripts and tools used are publicly available (https://bitbucket.org/dholab/mhc-long-amplicon-allele-discovery-in-mafa-from-china). In brief, the PacBio reads of insert (the consensus sequence generated from all passes around the circular SMRTbell template) were processed to eliminate indel sequences, short reads, and chimeric sequences using the SMRT Analysis v2.3 Reads of Insert protocol. Processed reads perfectly matching known full-length cynomolgus macaque transcripts were removed prior to clustering by using bbmap (version 34) to map reads to known full-length allele sequences. Reads not identical to known allele sequences were clustered using USEARCH–cluster_fast (version v8.0.1517_i86osx32) and requiring 100% identity within clusters. Clusters supported by three or more identical full-length reads were then binned as either extensions of known partial cynomolgus macaque sequences or putative novel transcripts, and a genotyping table with read counts per sample was generated. All extension and putative novel sequences were manually validated using Geneious Pro v9.0 (Biomatters Limited, Auckland, New Zealand) to perform BLAST alignments and create BAM alignment files of the reads of insert against the validated transcripts. Validated sequences were submitted to GenBank and the Immuno Polymorphism Database for the Major Histocompatibility Complex genes of Non-Human Primates (IPD-MHC NHP) (http://www.ebi.ac.uk/ipd/mhc/nhp/index.html) (de Groot et al. 2012; Robinson et al. 2013) for official nomenclature.
Data availability
Primer sequences are listed in Supplemental Fig. 1 and 2. Lineage-level Mafa-A and Mafa-B haplotypes assigned to each animal from the MiSeq exon 2 data are available in Supplemental Table 1, and full MiSeq read counts per transcript per animal are available in Supplemental Fig. 5. Full resolution Mafa-A and Mafa-B haplotypes assigned to each of the 72 animals examined by PacBio sequencing are available in Supplemental Table 2, and full PacBio read counts per transcript per animal are available in Supplemental Fig. 6. Novel and extension full-length sequences as listed in Table 1 are available from GenBank (LN851916-LN852021, LN994391-LN994530, LN998188, KY047492).
RESULTS AND DISCUSSION
Lineage-level haplotype analysis
In human and macaque MHC class I nomenclature, closely related alleles originating from the same gene are assigned to the same lineage group, signified by the 2-3 digits between the asterisk and the first colon of MHC class I sequence names. Since the Illumina MiSeq exon 2 genotyping amplicon is focused on a small segment of the MHC class I molecule, it is generally only sufficient for determining the lineage groups of any detected MHC class I sequences (considered ‘three-digit’ resolution in human HLA genotyping and referred to as ‘lineage-level’ resolution here). However, it provides a rapid, high-throughput methodology for haplotype analysis, especially within related animals. Using the parent/progeny trio and duo sets within these two cohorts, we described haplotypes of Mafa-A and Mafa-B transcripts inherited together from parents to offspring. Two representative trio sets with parents and their progeny are illustrated in Fig. 1a. Any transcripts observed as shared between parent and offspring were assigned to the inherited haplotype; any remaining transcripts observed within a parent, or in an offspring with just one representative parent, were assigned to the alternate haplotype. For animals without a direct relative in the cohort, haplotypes were inferred based on identity to the defined inherited haplotypes. We focused on the ‘major’ transcripts with high levels of steady state RNA for haplotype definitions. Abbreviated haplotype names were assigned based on presence of a major ‘diagnostic’ sequence that was typically the most abundant transcript on the haplotype, i.e., B013 haplotypes express the diagnostic major transcript Mafa-B*013. Different combinations of major allele lineages inherited with each diagnostic transcript were denoted with lowercase letters following the haplotype name, e.g., Mafa-B*013 inherited with Mafa-B*007 was designated B013a and Mafa-B*013 inherited with Mafa-B*137 was designated B013d. The full list of Mafa-A and Mafa-B haplotypes defined in these cohorts of cynomolgus macaques from Chinese breeding facilities is given in Supplemental Table 1.
A total of 52 distinct Mafa-A haplotypes and 74 unique Mafa-B haplotypes were identified in the 100 animals (200 chromosomes) evaluated in this cohort (Supplemental Table 1). A summary of the lineage-level haplotypes observed in each animal is shown in Fig. 2, with haplotypes directly inherited from parent to offspring denoted with solid outlined boxes. Full MiSeq haplotyping results, including reads supporting each transcript call per sample, are also given in Supplemental Fig. 5. While overall the large number of related animals in the cohort skews haplotype frequencies, it should be noted that four Mafa-A haplotypes (A006, A022a, A027, and A065) and three Mafa-B haplotypes (B013d, B028d, and B039e) are each observed in ten or more individuals. There are 1-2 major alleles expressed on each Mafa-A haplotype, accompanied by 0-5 minor alleles. Each Mafa-B haplotype is associated with 1-7 expressed major transcripts and between 0-15 minor transcripts.
Though many studies have examined the lists of MHC class I transcripts expressed in cynomolgus macaques of multiple geographic origins, few have studied the haplotype structures in those animals (Budde et al. 2010; Campbell et al. 2008; Otting et al. 2009, 2012; Saito et al. 2012; Shiina et al. 2015). The best-characterized population of cynomolgus macaques is from the island of Mauritius, where a limited number of founding animals were deposited on the island in the 1500s. Since the entire current population arose from a small number of ancestors, the diversity of their MHC region (and their entire genome) is limited to five lineage-level Mafa-A haplotypes (A031, A032a, A033, A060, A063) and seven lineage-level Mafa-B haplotypes (B019a, B045a, B075a, B104a, B147, B151a, B164a) (Budde et al. 2010; Wiseman et al. 2013). The diversity observed in these cohorts of 100 individuals from two Chinese breeding facilities represents a ten-fold increase in diversity at each locus compared to Mauritian cynomolgus macaques. While both populations of cynomolgus macaques are popular for biomedical research, this difference in overall diversity suggests that investigators should carefully consider their objectives when choosing a source for their animals. For protocols where MHC may be a confounding factor, like SIV and transplantation research, the restricted diversity of the Mauritian cynomolgus macaque population is particularly attractive. When testing vaccines or other drug therapies for potential negative side effects however, an extremely diverse source like cynomolgus macaques from Chinese breeding facilities may be advantageous. The overall diversity observed in this cohort also suggests that researchers should have their animals haplotyped if the MHC is a possible modifying factor, since this study revealed a large number of haplotypes despite the fact that multiple related animals were evaluated.
This study likely represents a snapshot of the total MHC diversity available within cynomolgus macaques from Chinese facilities, largely dependent on the country of origin of their breeding animals. Of the 52 total Mafa-A haplotypes observed here, nine (17%) were observed exclusively in animals of reported Vietnamese heritage, 17 (33%) were of reported Cambodian-origin only, and 4 (8%) were observed in the small set of reported hybrid Cambodian/Indonesian-origin animals (Supplemental Table 1). Only 22 (42%) Mafa-A haplotypes were observed in samples from multiple origins. The results for the 74 Mafa-B haplotypes are even more striking with only 21 (28%) observed in samples from multiple origins. Among the remaining Mafa-B haplotypes, 14 (19%) were exclusive to Vietnamese-origin animals, 26 (35%) exclusive to Cambodian-origin, and 13 (18%) were observed only in the mixed Cambodian/Indonesian-origin individuals. No previous MHC haplotype studies have been performed in either Vietnamese–or Cambodian-origin cynomolgus macaques, but limited studies have characterized haplotypes in Filipino, Indonesian, and Malaysian cynomolgus macaques (Campbell et al. 2008; Otting et al. 2009, 2012; Saito et al. 2012; Shiina et al. 2015). The cohorts studied here appear more MHC diverse than those sampled from the previously described insular and extremely southern continental Asian populations (it is unclear from precisely where the Malaysian cohort was initially derived).
Full-length allele discovery
In this study we employed a novel method for next-generation sequencing of full-length MHC class I alleles, using the PacBio RS II SMRT long read sequencing instrument. In our previous study with P4-C2 PacBio sequencing (Westbrook et al. 2015), data analysis relied on creation of contigs at 99% identity (largely to account for insertion/deletion artifacts) to define transcripts. Since this initial study, advances in PacBio sequencing with P6-C4 reagents and analysis software have eliminated the need to include less than 100% identical reads for identification of novel sequences. From the full 100 animal cohort, we selected 72 total samples across both breeding centers for PacBio sequencing based on lineage-level haplotype diversity. Table 1 presents a complete list of all 313 full-length MHC class I transcripts that we detected in these 72 individuals; these include 126 Mafa-A, 174 Mafa-B, and 13 Mafa-I transcripts. The complete PacBio sequencing results for each animal including numbers of identical sequencing reads supporting each transcript call are shown in Supplemental Fig. 6. Overall, nearly 80% of the transcripts identified in this study added new information to the existing Mafa MHC class I allele databases – 131 (42%) of the sequences were completely novel to the database, and 116 (37%) extended short but previously named database sequences to cover full-length open reading frames (Fig. 3).
This study improved the extant cynomolgus macaque allele databases in both total number of sequences and, more importantly, in the percentage of known transcripts that have been unambiguously described over the full coding region. The November 2015 update of IPD-MHC NHP contained a total of 899 Mafa-A, Mafa-B, and Mafa-I sequences. Of these, only 386 (43%) were full-length from start codon to stop codon. This study adds an additional 131 novel transcripts to the database, bringing the total to 1,030 named MHC class I sequences in the cynomolgus macaque database. With the additional 116 extant sequences extended to full length, the database of Mafa-A, Mafa-B, and Mafa-I transcripts following this study now contains 633 full-length sequences (61% of the 1,030 total transcripts).
We also examined distinct, full resolution haplotypes in the samples that were PacBio sequenced to compare to the previously determined lineage-level haplotypes from the full cohort. Fig. 1b shows the PacBio haplotype results for two representative trio sets of samples, and Supplemental Table 2 provides the complete list of full resolution haplotypes observed in the cohort. Haplotype names were retained from the lineage-level haplotype results, but sub-haplotypes (indicated with Roman numerals in the “Sub-Haplo” column of Supplemental Table 2) were assigned for any distinctions in specific transcripts. For instance, the lineage-level haplotype A001 has four (i-iv) distinct sub-haplotypes in the PacBio data, as they all vary from each other at the transcript level (A001i – A001iv carry Mafa-A1*001:01:02, Mafa-A1*001:01:03, Mafa-A1*001:02:01, and Mafa-A1*001:05, respectively). For the PacBio haplotypes, both major and minor specific transcripts were considered when assigning sub-haplotypes. As expected, results were concurrent with the lineage-level haplotypes overall, but the full-length allelic resolution provided by PacBio sequencing resulted in a significant expansion of the number of both Mafa-A and Mafa-B haplotypes. In total we identified 92 and 101 distinct full-resolution Mafa-A and Mafa-B haplotypes, respectively, in the 72 PacBio-sequenced samples. As with the lineage-level haplotypes, more than 90% of the Mafa-A and Mafa-B PacBio haplotypes were only observed in individuals from a single origin– 26 Mafa-A and 29 Mafa-B haplotypes were distinct to Vietnamese-origin samples, 43 Mafa-A and 46 Mafa-B haplotypes were only observed in Cambodian-origin samples, and 15 Mafa-A and 15 Mafa-B haplotypes were distinct to the hybrid Cambodian/Indonesian-origin samples (Supplemental Table 2).
It is unclear how many of these allelic distinctions make a functional biological difference in regards to immune responses. Although such confirmatory studies are beyond the scope of this report, we predict that specific nonsynonymous nucleotide variants between two closely related alleles may result in significant biological differences, particularly when the nucleotide differences alter the resulting amino acid within the peptide binding pocket. Therefore, full resolution genotyping provided by PacBio sequencing is particularly attractive for researchers looking for linkage between a specific immune response and specific MHC class I genotypes.
Overall, this study describes some of the first MHC class I Mafa-A and Mafa-B haplotypes in cynomolgus macaques from Chinese breeding facilities, including samples of reported Vietnamese, Cambodian, and mixed Cambodian/Indonesian origins. It also describes an improved PacBio long read sequencing approach to characterize full-length MHC class I transcripts without contig assembly, using only the highest quality reads. The transcripts identified here both increase the total number of known Mafa-A, Mafa-B, and Mafa-I sequences, and significantly improve the percentage of full-length transcripts in the nonhuman primate IPD database. Finally, the general PacBio-based method described here for long-amplicon sequencing can also be applied to other complex immune loci of macaques such as MHC class II, killer immunoglobulin-like receptors, and Fc gamma receptors, as well as those of other model organisms.
ACKNOWLEDGMENTS
The authors thank Catherine Westbrook, Suzanne Mate, and Galina Koroleva at the U.S. Army Medical Research Institute of Infectious Diseases, and Katherine Munson and the staff of the University of Washington PacBio Sequencing Services core for performing PacBio RS II sequencing. We also thank Annemiek J.M. de Vos-Rouweler for RNA isolations and preparation of cDNAs used in this study. Finally, we thank Nel Otting, Natasja de Groot and the IMGT Non-human Primate Nomenclature Committee for obtaining accession numbers and providing official Mafa transcript nomenclature designations for the sequences described here.
This research was supported by contracts HHSN272201600007C and HHSN272201100013C from the National Institute of Allergy and Infectious Diseases of the National Institutes of Health, and was conducted at a facility constructed with support from the Research Facilities Improvement Program (RR15459-01, RR020141-01).