Abstract
With the advent of Next Generation Sequencing, many population specific whole genome sequences published thus far, predominantly represent individuals of European ancestry. While sequencing efforts of underrepresented communities in genomes datasets, like the Yoruba West-African, Han Chinese, Tibetan, South Korean, Egyptian and Japanese have recently added to the public genomic repositories, a comprehensive understanding of human genomic diversity and discovery of trait-associated variants necessitates the need for additional population specific analysis. In this context, the genomics of the population from the Indian sub-continent, given its genetic heterogeneity needs further elucidation.
In this context, the endogamous Zoroastrian-Parsi community of India, offer an exceptional insight into a homogenous population that has culturally, socially, and genetically remained intact, for 13 centuries amidst the genomic, social and cultural Indian landscape, consequent to their migration from the ancient Persian plateau.
Notwithstanding longevity as a trait, this endangered community is highly susceptible to cancers, rare genetic disorders, and display a documented high incidence of neurodegenerative and autoimmune conditions. The community as a matter of cultural practice abstains from smoking.
Here, we describe the assembly and annotation of the genome of an adult female, Zoroastrian-Parsi individual sequenced at a high depth of 173X using a combination of short Illumina reads (160X) and long nanopore reads (13X). Using a combination of hybrid assemblers, we created a new, population-specific human reference genome, The Zoroastrian-Parsi Genome Reference Female, AGENOME-ZPGRF, contains 2,778,216,114 nucleotides as compared to 3,096,649,726 in GRCh38 constituting 93.235% of the total genomic fraction. Annotation identified 20833 genomic features, of which 14996 are almost identical to their counterparts on GRCh38 while 5837 genomic features were covered in partial. AGENOME-ZPGRF contained 5,426,310 variants of which the majority were SNP’s (4,291,601) and 960,867 SNPs were AGENOME-ZPGRF specific personal variants not listed in dbSNP.
We present, AGENOME-ZPGRF as a whole reference for any genetic studies involving Zoroastrian-Parsi individuals extending their application to identify disease relevant prognostic biomarkers and variants in global population genomics studies.
Introduction
Recent technological advances in high-throughput genome sequencing have brought a steep decline in the cost of genetic information1, while increasing the predictive power and path to clinical translation of risk estimates for common variants found in genome wide association studies2. Most massively parallel sequencing approaches use simple alignment of short reads to a reference genome to study genomic variation. While the approach has been successful3, an exhaustive study of structural variants and SNPs at a high depth, coverage, and confidence is essential for translation to precision medicine. The increase in long read sequencing technologies, as part of the 3rd generation genomic approaches have facilitated the assembly of large eukaryotic genomes in the last decade4,5. These advanced genomic platforms have given us powerful methods to generate long reads that compliment short accurate reads like those from Illumina sequencing chemistry to complete gaps and get better contiguity for the overall human genome.
Medical genetics has taken a leap forward in personalized medicine with the information of whole genome sequence for inheritable conditions, birth defects and chromosomal disorders6. Personalized genome assembly has shed light on the effects that non-genetic, disease-linked etiologies like methylation of CpG base pair islands have on gene availability for transcription7,8. With the advances in genomic sequencing, the importance of understanding genetic variability across extant human genomic diversity has become crucial9, especially since the current reference genome assembly GRCh3810 and variants cover only a sub-section of global population sub-types due to its mosaic nature of Caucasian and African genomic admixtures. Therefore, approaches focusing on understanding the minor, ethnic population groups, hitherto unrepresented in major genome variation studies, such as HapMap11, 1000 Genome Initiative12, Human Genome Diversity Project13 and disease specific variation studies like TCGA14 has become a prerogative in population genomics. This approach has been extended to sequence whole genome population references from Chinese15, Ashkenazi16, Korean17, Japanese18, Turkish19, Egyptian20, South Indian Asian-Indian21,22 and many more draft genomes in the recent years. The availability of reference genomes from multiple human populations greatly aids attempts to find genetic causes of traits that are over- or under-represented in those populations, including susceptibility to disease.
The Indian subcontinent is a hotspot of social, ethnic and genetic diversity with waves of migration to Southeast Asia through India23. The genetic landscape of this region is mainly constituted from Austro-Asiatic (AA), Indo-European (IE), Tibeto-Burman (TB), and Dravidian (DR) families with cultural and social frameworks that discourage and at times prohibit intermarriages between ethnic groups24,25. The extensive genetic diversity of India with genetically isolated subpopulations, makes it an ideal for population genomic studies to explore the disease–variants relationships.
The Zoroastrian-Parsi of India represent one such endogamous, genetically homogenous community. While the community members have a longer median life span, their present numbers in India are dwindling making a genomic study of the community critical for population genomics. This community in India trace their origins to migrations from the Persian plateau (∼847 AD), from Pars and Khorasan through the island of Hormuz to India where they settled as Parsis and practiced their faith, Zoroastrianism. The venerate Fire as the medium of worship and practice ostracism against smokers, therefore representing an important genomic biobank in understanding diseases associated with nicotine dependence. The community has a high prevalence of cardiovascular disorders, Autoimmune disorders like Rheumatoid Arthritis, Neurological/Neurodegenerative conditions like Parkinson’s Disease, Alzheimes Disease and different types of cancers.
Here, we describe the assembly and annotation of the genome of an adult female, Zoroastrian-Parsi individual sequenced at a high depth of 173X using a combination of short Illumina reads (160X) and long nanopore reads (13X). Using a combination of hybrid assemblers, we created a new, population-specific human reference genome. The Zoroastrian-Parsi reference genome, AGENOME-ZPGRF, contains 2,778,216,114 nucleotides as compared to 3,096,649,726 in GRCh38. Annotation identified 20674 genomic features, of which 15235 are > 99% identical to their counterparts on GRCh38, while the remaining genes were found to covered in partial. AGENOME-ZPGRF contained 5,426,310 variants of which the majority were SNPs (4,291,601) and 960,867 SNPs were AGENOME-ZPGRF specific not listed in dbSNP.
Materials and Methods
Sample collection and ethics statement
The donor is a healthy, non-smoking Parsi female volunteer (age: 65 y.o), invited to attend blood collection camps at the Zoroastrian center in the city of Bangalore, India under the auspices of The Avestagenome Project®. The adult female (>18 years) underwent height and weight measurements and answered an extensive questionnaire designed to capture her medical, dietary, and life history. The subject provided written informed consent for the collection of samples and subsequent analysis. All health-related data collected from the cohort questionnaire were secured in The Avestagenome Project® database to ensure data privacy.
Genomic DNA extraction
Genomic DNA from the buffy coat of peripheral blood was extracted using the Qiagen Whole Blood and Tissue Genomic DNA Extraction kit (cat. #69504). Extracted DNA samples were assessed for quality using the Agilent Tape Station and quantified using the Qubit™ dsDNA BR Assay kit (cat. #Q32850) with the Qubit 2.0® fluorometer (Life Technologies™). Purified DNA was subjected to both long-read (Nanopore GridION-X5 sequencer, Oxford Nanopore Technologies, Oxford, UK) and short-read (Illumina Technologies)
Library preparation and sequencing on the Nanopore platform
Libraries of long reads from genomic DNA were generated using standard protocols from Oxford Nanopore Technology (ONT) using the SQK-LSK109 ligation sequencing kit. Briefly, 1.5 µg of high-molecular-weight genomic DNA was subjected to end repair using the NEBNext Ultra II End Repair kit (NEB, cat. #E7445) and purified using 1x AmPure beads (Beckman Coulter Life Sciences, cat. #A63880). Sequencing adaptors were ligated using NEB Quick T4 DNA ligase (cat. #M0202S) and purified using 0.6x AmPure beads. The final libraries were eluted in 15 µl of elution buffer. Sequencing was performed on a GridION X5 sequencer (Oxford Nanopore Technologies, Oxford, UK) using a SpotON R9.4 flow cell (FLO-MIN106) in a 48-hr sequencing protocol. Nanopore raw reads (fast5 format) were base called (fastq5 format) using Guppy v2.3.4 software. Samples were run on two flow cells and generated a dataset of ∼14 GB.
Library preparation and sequencing on the Illumina platform
Genomic DNA samples were quantified using the Qubit fluorometer. For each sample, 100 ng of DNA was fragmented to an average size of 350 bp by ultrasonication (Covaris ME220 ultrasonicator). DNA sequencing libraries were prepared using dual-index adapters with the TruSeq Nano DNA Library Prep kit (Illumina) as per the manufacturer’s protocol. The amplified libraries were checked on a Tape Station (Agilent Technologies) and quantified by real-time PCR using the KAPA Library Quantification kit (Roche) with the QuantStudio-7flex Real-Time PCR system (Thermo). Equimolar pools of sequencing libraries were sequenced using S4 flow cells in a Novaseq 6000 sequencer (Illumina) to generate 2 x 150-bp sequencing reads for 30x genome coverage per sample.
Raw fastq files Illumina and nanopore reads
The sample genome was of an adult female from the endogamous Parsi community which was used for the construction of 173X Zoroastrian Parsi Whole Genome Assembly. Illumina HiSeq with a read length of 2 × 150 bp is used for obtaining short reads for the genome. We obtained a total of 2.2 Billion sequences from the Illumina HiSeq platform (160X) and a total of 6.8 Million reads from the Nanopore platform. For the long reads, the library preparation was according to the standard protocol and the sequencing of the genome was performed using the Oxford Nanopore Minion platform
Quality trimming and Quality control of the reads
Quality trimming and adapter removal of the short Illumina platform reads was performed using AdapterRemoval (version 2.2.2)26 with minlength 30, trimwindow size 30 and reads lesser than quality score of Q30 were discarded. For adapter removal of long Oxford nanopore reads, Porechop tool (V0.2.4)27 with default options was used. The long error prone reads from Oxford Nanopore cannot cross the quality score of Q20 hence the cutoff was kept to 8 in this case. All the quality scores are checked using FastQC (version 0.11.5)28 and FastP (V0.20.1)29 tool (Supplementary Figure 1, 2).
Whole Genome assembly
The quality trimmed and adapter removed short and long reads were processed for Hybrid assembly. The choice of hybrid assembly was made using relevant literature study where short read alone, long read alone and short-long read hybrid assemblies were compared for different cases30 and hybrid assemblies outperformed and gave better QC statistics reflecting better quality of the assembled genomes. The raw data was sub sampled to 60X coverage with length cutoff of 60 bp using fastP according to the instruction on the Wengan GitHub repository. The processed reads were assembled with Wengan31 using D mode (uses DiscovarDenovo short-read assembler) with options -l ontraw, -g 3000 (3Gbp). An alternative assembly was generated by using HASLR, Wtdbg232, WenganA and WenganM assemblers.
Removing mis-assemblies at segmental duplications and centromere regions
Centromere regions were downloaded from UCSC web browser for GRCh38 version of human reference genome. Segmental duplications in a BED format flat file was downloaded from the GitHub repository (segDupPlots/ucsc.collapsed.sorted.segdups). A python script20 was used to remove miss-assemblies from Segmental duplications and centromere regions. Identification and annotations of repetitive elements was obtained using REPEATMASKER (V4.1.1)33 by aligning the genome sequences against known library of repeats in humans.
Read mapping and variant calling for Illumina sequencing reads
The variant detection for the AGENOME-ZPGRF female sample was carried out using GATK pipeline (V4.1.5.0)35, Picard (2.21.9) and Samtools (1.3.1)36. The GATK pipeline included read mapping and variant processing. Single-nucleotide variants (SNVs) and indels were called by local reassembly of haplotypes using HaplotypeCaller of GATK V4.1.5.0.
The following workflow was used for variant calling, the raw reads were pre-processed, converted to unaligned BAM and readgroup information were assigned using FastqToSam. The adapters were tagged using Markilluminadapter function and the bam file were converted to interleaved fastq sequences to map to the reference genome. The reference genome (GRCh38) is indexed using BWA index and samtools, further the sequence dictionary for reference was obtained using picard CreateSequenceDictionary function. Mapping the FASTQ reads to reference genome was performed by BWA-MEM (version 0.7.17-r1188). Information from unaligned BAM and the aligned BAM were merged using MergeBaMAlignment to retain the raw read information. The duplicate reads through experimental artefacts are tagged using MarkDuplicates (picard) module. The base quality score recalibration (BQSR) was applied to overcome the errors associated with base quality score due to sequencing errors. The BAM file were further indexed to identify variants by HaplotypeCaller. The variants obtained from HaplotypeCaller were annotated using SnpEff (4.3t), a genetic variant annotation and effect prediction toolbox.
Structural variants
SVs were called using DELLY233 with default parameters on duplicate marked bam file for germline SV calling (https://github.com/dellytools/delly).
Pharmacogenomics relevance
To assess the pharmacogenomics relevance, we obtained common variants in AGENOME-ZPGRF and dbSNP-138 database. These variants were annotated based on PharmGKB (www.pharmgkb.org) database34 to obtain pharmacogenomics association. The variants that were classified as conflicting-interpretation, uncertain significance and benign were removed. The variants that had Pharmacokinetic (PK) and Phamacodynamic (PD) associations were considered to obtain actionable SNPs.
Results
Benchmarking of hybrid assemblers for Whole Genome Assembly using Chr22
De novo assembly was performed on a female Parsi whole-genome sequencing data. The data was in the form of Illumina-paired end 160X coverage and Oxford nanopore data of 13X. To standardize the pipeline, Chromosome 22 data (genome size of 52 Mbp) was extracted from the whole genome data and tested with iterative combination of different assembler strategy. The Illumina paired-end short reads with a read length of 150 bp were assembled using the Abyss35 assembler resulting in a total length of 44,331,422 bp and a contig N50 of 15,753 bp.
We proceeded to use Hybrid assemblers known to perform scaffolding using long reads and polishing using short reads. Wengan assembler outperformed all other hybrid assemblers by producing a total length of 32,346,746 bp with 194 contigs greater than 50,000 bp. Quickmerge36 meta-assembler was applied on Wengan assembly which gave the lowest number of contigs versus length and Abyss (160X) gave the longest length with the higher number of contigs improving the contiguity of the assembly. The results for this exercise produced a total length of 50,216,737 bp which is close to 90% of the total length of Chr22 and 2,675 contigs (Appendix 1).
De novo assembly of the First complete Zoroastrian Parsi Whole Genome Reference Female (AGENOME-ZPGRF)
Following the assembly of Chr22, we extended our meta-assembly strategy to assemble the whole genome reference (Figure 1). Our Zoroastrian-Parsi genome (AGENOME-ZPGRF) is based on high-quality de novo assembly from one female Parsi individual. The assembly was generated from a combination of short and long read data sets: 2×150 bp Illumina paired-end reads (160X), Oxford Nanopore (13X) reads averaging over 5,784 bp in length (Table 1).
We initially created five hybrid assemblies using Illumina short reads and Oxford Nanopore Technology long reads. Three assemblies were based on Synthetic Scaffolding Graph approach using Wengan hybrid genome assembler, one assembly based on synthetic paired end reads using HASLR and one assembly based on fuzzy De-Brujin graph method using wtdbg2. WenganD gave the best results with respect to total length of 2.7 Giga bases (Gbp), N50 of 2 Mega bases (Mbp) and genomic fraction of 93.2%. The annotation based on QUAST-LG identified 14,996 complete and 5,837 partial number of genomic features (Table 2). The Parsi reference genome, AGENOME-ZPGRF, contains 2,778,216,114 nucleotides as compared to 3,096,649,726 in GRCh38. This assembly, designated AGENOME-ZPGRF, was the basis for all subsequent refinements and analysis.
Completeness of the genome
Gene completeness was measured with BUSCO54 v.4.1.437 using the Primates ODB-10 gene set. BUSCO provides intuitive metrics to describe genome, gene set or transcriptome completeness38. We observed 88.3% of genome completeness, 12,165 complete BUSCOs, 12,113 complete and single copy BUSCOs, 52 complete and duplicated BUSCOs, 461 fragmented BUSCOs, 1,154 missing BUSCOs. The total BUSCO group searched was 13,780.
Repeated elements in AGENOME-ZPGRF
When annotating repeats with REPEATMASKER, about 48.34% of the genome (Table 3) was identified as repetitive, with its results similar to those from EGYPTRef, AK1 and YORUBA genome assemblies. Most of the repetitive elements comprised of 21.80% Long Interspersed Nuclear Elements (LINEs), 13.45% of Short Interspersed Nuclear Elements (SINEs) and the rest in ALU elements, Mammalian-wide Interspersed Repeats (MIRs), Long Terminal Repeats (LTR) elements, DNA elements, small RNAs, satellites and simple repeats. Further, we found that about two-thirds of the SNPs identified in the repeat regions were found in long interspersed elements (LINE; 21.80%; majority occurring in LINE1 elements) or short interspersed elements.
Variant identification
We used the GATK variant calling pipeline, performed according to GATK best-practice recommendations and the HaplotypeCaller tool was employed for identifying putative variants, followed by Snpeff (build 2017-11-24) to annotate and make functional predictions. Our analysis revealed 5,426,310 variants of which 79% are SNPs (4,291,601) and 21% are indels, multiple-nucleotide polymorphism (MNPs) and mixed variants (Figure 2). The transitions (297,330), transversions (251,540), Ts/Tv, ratio was 1.18, resembling expected figures in similar studies. Among the identified SNPs, 41.40% were intronic and 40.78% were intergenic while the rest were SNPs in the upstream (8.84%) and downstream region (7.19%). SNPs in the exonic region constituted only 0.5% of the total SNP count and the SNPs in the untranslated 3′ or 5′ region made up the rest (Figure 2, Supplementary Figure 3). Of these SNPs, 14,572 were missense (non-synonymous), 13,321 were silent (synonymous) substitutions and 189 nonsense SNPs. This is consistent with a non-syn:syn (dN/dS) ratio of ∼1 expected of a normal genome39.
Based on the SNPEff40 annotation, we sought to identify the functional impact of the SNP’s in terms of “high”, “low”, “moderate” impact based on their occurrence on the genome. High impact SNPs occur when (i) the variant hits a splice acceptor/donor site, (ii) a start codon is changed into a nonstart codon, or (iii) a stop codon is gained or lost due to the variant. We identified 1652 SNPs with a high-impact effect (Appendix 2), 16128 with low impact and 14856 with moderate impact. The majority of the high impact variants occurred on the gene DPP6 (n=2920, Appendix 4, 6), variants of which have been reported to be associated to familial idiopathic ventricular fibrillation41.
Out of the ∼4.2 million SNPs identified, there were 960,867 potentially novel SNPs that did not exist in dbSNP42 (Appendix 3, 7). Further analysis of the AGENOME-ZPGRF specific novel SNPs showed, 50.9% and 31.2% were in intergenic and intronic regions, respectively. We found 9.2% were upstream, 7.17% downstream of a gene, and 4202 (or 0.49%) of the SNPs were found to be in coding regions (Table 5). Among the 4202 SNPs in coding regions, we could further classify 31 nonsense SNPs and a total of 415 SNPs with a high impact (Appendix 5). Most of the variants (both genomewide and AGENOME-ZPGRF specific variants) occurred on Chr1, while the highest frequency of distribution occurred on Chr22 (Table 4). We identified 1,133,653 indels, which consisted of 546,985 insertions and 586,668 deletions. Of these indels, 503,214 (or 44%) were found to be novel. Majority of the unique variants occurred on LOC105379427 (n=1552, Appendix 4) that code for zinc finger protein 717-like proteins (putative) and DUX4L18/DUX4L19/CNTNAP3B genes which have been implicated in cognitive disorders.
The high impact SNPs across the AGENOME-ZPGRF genome are distributed among 1,014 protein-coding genes in the genome and 311 non-coding regions. We next classified the high impact variants to understand the significance of the coding (sSNPs) and non-coding SNPs (nsSNPs), in terms of their distribution in protein class, pathways, biochemical activity with KEGG43 pathways using DAVID44. The distribution of both high impact sSNPs and nsSNPs was significantly enriched in G protein coupled receptor pathway genes, olfactory transduction. Our finding is consistent with studies that demonstrate higher levels of polymorphism observed in human olfactory gene family45. In addition, we found enrichment for pathways associated with neuroactive ligand-receptor and osteoclast differentiation. The majority of the high impact coding SNPs belonged to transmembrane helices, transmembrane and receptor protein classes. Reactome based pathway analysis46 showed that pathways for antigen presentation: Folding, assembly and peptide loading constituted the major pathway implicated for genes harboring the coding and non-coding SNPs (Figure 4).
Genetic structural variation in AGENOME-ZPGRF
Using short-read sequencing data of AGENOME-ZPGRF, we called 69,148 SVs using DELLY2 structural variant prediction tool33 (Figure 3). We observed that while most of the SVs were deletions (n=40070), we found other SVs categorized as inversions (n=6004), duplications (n=5808), insertions (n=2129) and translocations (n=15137). No mis-assemblies were observed outside centromeric regions and segmental duplication regions.
Pharmacogenomics and drug risk assessment using AGENOME-ZPGRF SNPs
One of the aims of personalized genomics is to assess the individuals SNPs for disease and drug reaction assessment aiding drug dosage regimens. Using the variant-drug risk correlation annotation in the PharmGKB database34 and KEGG database, we sought to understand SNPs of pharmacogenomic relevance in the AGENOME-ZPGRF. We identified 20 unique SNPs (Appendix 8) associated with 12 genes distributed across 9 chromosomes (chr1,2,4,7,8,11,14,15 and 16) with pharmacogenomic relevance based on PharmGKB (Appendix 7). We identified 10 actionable SNPs from literature as it pertains to treatment with various drugs, some of which are also represented in the PharmGKB (Table 3).
Discussion
We present the first, high depth whole Zoroastrian-Parsi Genome Reference Female sequence, AGENOME-ZPGRF for the Zoroastrian-Parsi population of India. The AGENOME-ZPGRF is a high depth whole genome sequence at 173X, combining genomic reads from short read (160X; Illumina) and long read (8X; 5X; Oxford Nanopore Technologies) sequencing technologies. AGENOME-ZPGRF represents the first high depth whole genome sequence from the Indian subcontinent, where there have been previous genome assemblies of male47,22 and female21 at <40X. AGENOME-ZPGRF, contains 2,778,216,114 nucleotides as compared to 3,096,649,726 in GRCh38. AGENOME-ZPGRF is unique as it is derived almost entirely from a single individual unlike GRCh38, which represents a mosaic of multiple individuals, thereby, adding further insight into personal genome and variant-disease association approaches. We have extended our study to sequence the first Zoroastrian-Parsi male genome, presently 2,589,561,354 bp in length with 87.46% genome fraction mapping to GRCh38 (Appendix 9).
The genome completeness is 93.25% which is on par compared to other high resolution whole genomes from the Ashkenazi16, Egypt ref48 and Yoruba genome49 assembly projects. Our assembly quality is further enhanced by BUSCO completeness score of 88.3% validating our benchmarking process for genome completeness. Annotation identified 20,833 genomic features, of which 14,996 are > 99% identical to their counterparts on GRCh38. Most of the remaining genes were partial. This assembly, designated AGENOME-ZPGRF, was the basis for all subsequent refinements. Our analysis revealed 5,426,310 variants of which 79% are SNPs (4,291,601) and 21% are indels, MNPs and mixed variants. AGENOME-ZPGRF had 960,867 novel/personal SNPs not listed on dbSNP. The AGENOME-ZPGRF reference standard of this endogamous, socially, genetically divergent community adds valuable insights into human population genetic diversity as compared to other global populations. Furthermore, our study adds information to the catalogue of genomic variation derived from the 1000 human genome project consortium, which also includes samples of Indian origin (1000 Genomes Project Consortium).
Besides the nuclear genome, we had previously studied AGENOME-ZPGRF mitochondrial genome variants50 as the first de novo Zoroastrian-Parsi mitochondrial genome, AGENOME-ZPMS-HV2a-1 (Genbank accession, MT506314). Our analysis showed that the AGENOME-ZPGRF belongs to haplogroup HV2a and that showed 28 unique variants compared with the revised Cambridge Reference Standard (rCRS). HV2a is an extremely rare haplogroup, and prevalent among the Zoroastrians-Parsis in our study cohort. The haplogroup HV2a is closely associated with Caucasian descent, with its documented prevalence dating back to ancient Scythians who were geographically distinct group of nomads joined by common cultural expressions57. They date back to about the 9th century BCE until the 4th century CE56 tracing their origins to the Caspian Pontic Steppes and the Altai mountains51, indicative of the unique genomic landscape of the contemporary Zoroastrian-Parsi among Indian and European communities.
Variants in personal genomes can be used to assess disease risk, carrier status and drug response/interaction contributing to a pharmacogenomic insights in clinical genetics. We have assessed the AGENOME-ZPGRF genome using OMIM, PharmaGKB and KEGG databases for SNPs with health and disease consequences. We identified high risk for Multiple sclerosis, among other diseases that include cancers and neurodegenerative diseases. We found two CHRNA3, CHRNA5 alleles: rs16969968, rs1051730 gene variants that have been associated with cognition, possibly mediating in part risk for developing Nicotine Dependence52,53. We also found a C>A variant in C11orf65 located near ATM gene regulating metformin response in Type 2 diabetics54. Additionally, we found intronic variant (rs762551) in CYP1A2 associated with leflunomide induced toxicity in treatment for Rheumatoid Arthritis55. In the context of preventive pharmacogenomics association, we found the AGENOME-ZPGRF, harbored a SNP (C>T) in DPYD;DPYD-AS1 implicated in fatal consequences to 5-Fluorouracil (5-FU)-based treatments (4%-5%, early onset-severe to 0.3%, fatal) in patients with dihydropyrimidine dehydrogenase (DPD) deficiency.
In sum, our present study has delivered the first, complete, de novo, high depth genome assembly AGENOME-ZPGRF for Zoroastrian Parsi community of India. Analysis of the variants in AGENOME-ZPGRF indicated 960,867 novel/personal SNPs, some of which were found associated with adverse drug interactions in separate studies. We have also completed the whole genome assembly of a Zoroastrian-Parsi male, whose variant annotation is underway. Further analysis of personal variant-disease-drug response annotations that are gender specific made using clinically validated variants will complement current healthcare practices with personalized pharmacogenomics. This will lead to safe, accurate, drug dosage and treatment regimen by physicians and clinical trials.
Declarations
Ethics approval and consent to participate
We would like to state that this ethics review board is not affiliated with a commercial entity, and we confirm that the Ethics Review was sought from an independent ethics review board not affiliated with the funder or the commercial entity, in line with Declaration of Helsinski that the “committee must be transparent in its functioning, must be independent of the researcher, the sponsor and any other undue influence and must be duly qualified”.
The study was approved by the Institutional BioEthics Committee constituted by the Department of Biotechnology, Government of India (BIAG-CSP-033). The committee constituted is compliant with the scientific, medical, ethical, legal and social requirements of the research proposal and in line with the 1964 Helsinki declaration and its later amendments. All subjects have provided written informed consent for the collection of samples and subsequent analysis.
Competing interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Funding
The project was funded by the grant awarded to Dr.Villoo Morawala-Patell, titled “Cancer risk in smoking subjects assessed by next-generation sequencing profile of circulating free DNA and RNA” (GG-0005) by the Foundation for a Smoke-Free World, New York, USA.
Supplementary Figures
Acknowledgements
We would like to thank Dr.Raja Mugasimangalam and Dr. Sudha Rao, Genotypic Technologies, for their valuable inputs, Dr.Paul Morill and Dr.Farah Patell Socha for their excellent project management and guidance regarding GG-0005, Dr.Sanaya Patell McGaw for editorial assistance and Ms.Mahima Kishinani, Ms.Janet Rymound for contributing to literature review and analysis. We thank the Zoroastrian-Parsi community of India for their enthusiastic cooperation and the The Avestagenome Project® project team.