Locating suspicious lethal genes by abnormal distributions of SNP patterns

A gene, a locatable region of genomic sequence, is the basic functional unit of heredity. Differences in genes lead to the various congenital physical conditions of people. One kind of these major differences are caused by genetic variations named single nucleotide polymorphisms(SNPs). SNPs may affect splice sites, protein structures and so on, and then cause gene abnormities. Some abnormities will lead to fatal diseases. People with these diseases have a small probability of having children. Thus the distributions of SNP patterns on these sites will be different with distributions on other sites. Based on this idea, we present a novel statistical method to detect the abnormal distributions of SNP patterns and then to locate the suspicious lethal genes. We did the test on HapMap data and found 74 suspicious SNPs. Among them, 10 SNPs can map reviewed genes in NCBI database. 5 genes out of them relate to fatal children diseases or embryonic development, 1 gene can cause spermatogenic failure, the other 4 genes are also associated with many genetic diseases. The results validate our idea. The method is very simple and is guaranteed by a statistical test. It is a cheap way to discover the suspicious pathogenic genes and the mutation site. The mined genes deserve further study. Author summary Xiaojun Ding received the BS, MS and PhD degrees in computer science from Central South University. Now he is a assistant professor in Yulin Normal University. His research interests include computational biology and machine learning.


Introduction
Genes are the most important genetic materials which can determine the health of a 2 person in some ways. The functions of genes may be affected by the genetic variations 3 called SNPs. So it is a good way to study the disease-related genes from SNPs. Many 4 defective genes caused by SNPs for human Mendelian diseases (i.e. single gene 5 diseases) have bAsnaghi2013Ieen found [1,2]. For example, Prescott el al. [3] found a 6 nonsynonymous SNP in ATG16L1 related to Crohn's disease. Seki et al. [4] reported 7 that a functional SNP in CILP is suspicious to lumbar disc disease. These 8 achievements inspire people. 9 However, the discovered pathogenic genes caused by SNPs only take a small 10 fraction, most of them are still unknown. At the same time, the number of SNPs is 11 very large and most SNPs do not take effects on genes [5,6]. Checking all SNPs by 12 biological experiments is a expensive work. Narrowing the range of suspicious SNPs 13 will benefit the study of pathogenic genes greatly [7]. For the purpose, people analyzed 14 SNPs from various angles. Lee et al. [8] builded a functional SNP database which 15 integrates information got from 16 bioinformatics tools and functional SNP effects for 16 disease researches. Cargill et al. [9] studied the different rates of polymorphism within 17 genes and between genes. They concluded that the rates may reflect selection acting 18 against deleterious alleles during evolution and the lower allele frequency of missense 19 cSNPs are possibly associated with diseases. Adzhubei et al. [10] developed a tool 20 named PolyPhen which predicts possible impact of an amino acid substitution on the 21 structure and function of a human protein. Kumar et al. [11] developed a tool named 22 SIFT which predicts whether an amino acid substitution will affect protein function. 23 Their algorithm is suitable to naturally occurring nonsynonymous polymorphisms and 24 laboratory-induced missense mutations. While synonymous mutations can also 25 contribute to human diseases [12]. For example, Westerveld et al. [13] reported that a 26 intronic variants rs1552726 may affect the splice site activity. 27 In the paper, a novel method is proposed from the angle of genetic law. If a 28 defective gene caused by SNPs can lead to fatal diseases and most of the sick people 29 do not have the change to breed the next generation. This will affect the distributions 30 of the SNPs within the gene. It provides us a novel way to distinguish the pathogenic 31 SNPs from normal SNPs. 33 Given a bi-allele SNP, 'A' and 'a' are used to denote the major and minor allele, 34 respectively. Because chromosomes come in pairs, each individual will take one of the 35 following three SNP patterns: pattern0='AA', pattern1='Aa', pattern2='aa'. In a 36 population, the distribution of individuals taking each pattern can be counted. The 37 abnormal distributions are what we concern. Next, an example is given to illustrate   According to bisexual reproduction rule, a child will inherit one chromosome from 43 his mother and one from his father. If the mother takes pattern 'AA' and the father 44 takes pattern 'aa'. The child will take pattern 'Aa'. It is shown as Fig 1. If every one 45 has a equal probability to marry other people in the population. The probability of a 46 child with pattern 'Aa' should be 2*0.5*0.5=0.5, that means there should be about 47 0.5*1000=500 individuals taking the pattern 'Aa'. But none is observed. We think 48 that the distribution on this SNP site is abnormal. The reason for the abnormal 49 distribution is probably that the person taking 'Aa' will die in childhood so that we 50 can not observe them. From the analysis, a hypothesis is proposed as following.  In HapMap data, the SNP data of 11 human populations are sequenced. Since the 56 relationships of individuals in each population are unknown, we make a assumption to 57 simplify the computation.

58
Assumption 1: In each population, everyone has the equal probability to marry 59 other people and to give a birth to a baby. 60 For population j , the distribution P of individuals for all the patterns can be 61 counted. P = [p 0 , p 1 , p 2 ], where p i is the percentage of the individuals with pattern i .

62
Under Assumption 1 and the bisexual reproduction rule, the distribution among the 63 next generation (denoted by P * ) can be computed according to the distribution P .
If there is no big disaster, the distribution among a human population will not 66 change radically. Under usual circumstances, P * can be treated as an approximation 67 to the mean distribution of the current population. If p i is 0, but p * i is far from 0. The 68 distribution may be abnormal. Supposing the size of the population j is n j . The   Table 1.  SNPs, genes, gene types, alleles, disease patterns and p-values are listed in Table 2.

105
For the first two SNPs and their patterns, the expectations of the number of 106 individuals in each population are listed in Table 3 and  persistent infections starting in infancy or early childhood [22]. The result of the 124 disease is very serious and most affected individuals die in childhood [23]. 125 2) SNP rs4915931 maps gene ROR1

126
In ClinVar database [15], ROR1 is associated with malignant melanoma. Broome 127 et al. [24] reported that ROR1 is a receptor tyrosine kinase expressed during 128 embryogenesis, on chronic lymphocytic leukemia and in other malignancies.
Hudecek 129 et al. [25] found that ROR1 is highly expressed during early embryonic development 130 but expressed at very low levels in adult tissues. Many papers reported that ROR1 131 has a very close relation with chronic lymphocytic leukemia [26,27] and acute 132 lymphoblastic leukemia [28][29][30][31][32]. ROR1 is suggested as the targeted therapy for human 133 malignancies [33,34].  al. [41] found that DPP6 is suspicious to many diseases such as cardiovascular diseases, 145 endocrinological diseases, metabolic diseases, gastroenterological diseases, cancer, 146 hematological diseases, inflammation, muscle skeleton diseases, neurological diseases, 147 urological diseases, reproduction disorders and respiratory diseases. Zhu el al. [42] reported that Inpp5f is a polyphosphoinositide phosphatase that 150 regulates cardiac hypertrophic responsiveness. Kim et al. [43] found that INPP5F 151 inhibits STAT3 activity and suppresses gliomas tumorigenicity. Palermo et al. [44] 152 reported that gene expression of INPP5F can be as an independent prognostic marker 153 in fludarabine-based therapy of chronic lymphocytic leukemia. Bai et al. [45] reported 154 that alteration of Akt signal plays an important role in diabetic cardiomyopathy.

155
Inpp5f is a negative regulator of Akt signaling.

158
CCHCR1 is up-regulated in skin cancer and associated with EGFR expression [46].

159
The CCHCR1 (HCR) gene is relevant for skin steroidogenesis and downregulated in 160 cultured psoriatic keratinocytes [47]. 161 7) SNP rs1552726 maps gene NLRP14 162 NLRP14 may play a regulatory role in the innate immune system [48]. Mutations 163 in the testis-specific NALP14 gene in men suffering from spermatogenic failure in 164 GeneCards database [49]. Westerveld et al. [13] collected the data of 157 patients, they 165 identified 25 suspicious variants in total: 1 nonsense mutation, 14 missense mutations, 166 6 silent mutations and 4 intronic variants. By using ESEfinder and SpliceSiteFinder to 167 check these SNPs, only the SNP rs1552726 is predicted to affect the correct splicing. 168 Abe et al. [50] reported that germ-cell-specific inflammasome component NLRP14 al. [53] reported that a role for Jag2 promotes uveal melanoma dissemination and 180 growth. Vaish et al. [54] reported that JAG2 enhances tumorigenicity and 181 chemoresistance of colorectal cancer cells.  CBFA2T3 are also associated with many genetic diseases. Looked from the overall, we 206 think the results are good and can validate our idea in some ways. The method can 207 give a narrow range of suspicious pathogenic genes which deserve further studies. As 208 whole-genome sequencing advances, more and more data can be achieved, the method 209 can get more accurate and interesting results. The method is a simple and cheap way 210 to find the suspicious pathogenic genes and SNPs.