DeepGenePrior: A deep learning model to prioritize genes affected by copy number variants

The genetic etiology of neurodevelopmental disorders is highly heterogeneous. They are characterized by abnormalities in the development of the central nervous system, which lead to diminished physical or intellectual capabilities. Determining which gene is the driver of disease (not just a passenger), termed ‘gene prioritization,’ is not entirely known. In terms of disease-gene associations, genome-wide explorations are still underdeveloped due to the reliance on previous discoveries when spotting new genes and other evidence sources with false positive or false negative relations. This paper introduces DeepGenePrior, a model based on deep neural networks that prioritizes candidate genes in Copy Number Variant (CNV) mediated diseases. Based on the well-studied Variational AutoEncoder (VAE), we developed a score to measure the impact of the genes on the target diseases. Unlike other methods that use prior data on gene-disease associations to prioritize candidate genes (using the guilt by association principle), the current study exclusively relies on copy number variants. Therefore, the procedure can identify disease-associated genes regardless of prior knowledge or auxiliary data sources. We identified genes that distinguish cases from disorders (autism, schizophrenia, and developmental delay). A 12% increase in fold enrichment was observed in brain-expressed genes compared to previous studies, while 15% more fold enrichment was found in genes associated with mouse nervous system phenotypes. We also explored sex dimorphism for the disorders and discovered genes that overexpress more in one gender than the other. Additionally, we investigated the gene ontology of the putative genes with WebGestalt and the associations between the causative genes and the other phenotypes in the DECIPHER dataset. Furthermore, some genes were jointly present in the top genes associated with the three disorders in this study (i.e., autism spectrum disorder, schizophrenia, and developmental delay); namely, deletions in ZDHHC8, DGCR5, and CATG00000022283 were common between them. These findings suggest the common etiology of these clinically distinct conditions. With DeepGenePrior, we address the obstacles in existing gene prioritization studies. This study identified promising candidate genes without prior knowledge of diseases or phenotypes using deep learning.

177               Table 7. We have used this data to analyze the relation of genes and other 429 phenotypes; besides, data can be used for augmentation and pretraining of the system.  457 Additionally, some genes that were not the result of the model were removed. These include 458 genes that overlap more with controls than cases or genes that do not overlap with CNVs.

459
A Formal overview of a gene prioritization system 460 If we imagine gene prioritization as a system, the input is a target disease plus the list of all 461 genes; depending on the nature of the method that is used to process the genes, different 462 datasets may also be utilized as the auxiliary input; for example, mutation data (SNPs or 463 structural variants), protein networks, pathway data, or reliable causal genes related to a 464 target disease (to use with the principle of 'guilt by association'). The output is the list of 465 candidate genes (which may be sorted or unsorted, the result would be either prioritization or 466 classification). Also, a score may indicate the likelihood of a gene responsible for a 467 phenotype (or a disease). The discriminatory algorithm tries to infer each gene's role in the 468 target's incidence.

479
Each rare CNV is related to an individual (characterized by p_id), either healthy or patient.
480 Optionally, the dataset may provide auxiliary data for the individual, such as gender 481 information, which helps us investigate the discriminatory role of genes on each gender. Our 482 goal is to solve the gene prioritization problem using the set of rare CNVs.

483
The Method Overview 496 The main difference between autoencoders and variational version is that the first is 497 deterministic, whereas the second is probabilistic; a variational autoencoder is an autoencoder 498 whose training set is somehow regularized to avoid overfitting.
499 VAE is based on the Bayesian theorem and inference, with regularization constraint, 500 assuming that latent representation has a multivariate Gaussian, N (µ, σ).

501
It is shown that VAE is more stable during training and has less vague output than other 502 generative models; since it optimizes precise objective functions based on likelihood [52].
503 The posterior is a Gaussian distribution, whose output is mean and variance, and it is proved 504 that all functions can be approximated with it. The model tries to encode the inputs into a 505 Gaussian distribution with estimated mean and covariance.  548 A deep learning model is proposed for this task. According to the dataset for each disease, we 549 have a copy number of variants for patients and healthy individuals. Each set of copy number 550 variants for an individual has some overlaps with genes, and these overlaps are features that 551 feed into our deep learning. This is shown in Fig. 6. 555 We have a list of genes that we want to determine whether their expression will affect disease 556 incidence; on the other hand, we have a list of cases and controls with CNVs for a target 557 disease. We want to convert them to a supervised learning algorithm.
558 In healthy and patient individuals, we want to convert CNVs to genes. Computing overlaps 559 can do this. For the set of genes, preprocessed as discussed before, we measure the length of 560 overlap (in kbps) with the CNVs of an individual. The label of the training set is whether the 561 person is sick or healthy (zero or one).
562 In the pretraining phase of the model, we used all the CNVs of the neurodevelopmental 563 disorders (autism + schizophrenia + developmental delay). The CNV of a specific disease is 564 used for fine-tuning. This can be considered a form of semi-supervised learning.
565 After our VAE has been fully trained, we just use the encoder part directly for the next step: 28 566 1 Train a VAE using all our data points and transform our data (X) into the latent space (Z 567 variables) (We use all data in this step). 568 2 Solve a standard supervised learning problem with (Z, Y) pairs (Y is the label set). 569 A graphical model and its learning algorithm for the whole process are shown in Fig. 9.
572 Let's suppose that the encoder weights are represented by , where m is the layer number, 573 i is the output size in the last layer, and j is the input size in the current layer (no connection is 574 determined by zero). As we know, the final layer that will be attached to the encoder is the 575 label; and its size is one (whether the individual is patient (=one) or healthy (=zero)).
576 If we multiply all weights matrices together, the result has the size input size × 1 (the 577 matrices are multiplicable since the output of the last layer equals the input of the current 578 layer). The resulting matrix (specifically column vector) can rank genes according to the label 579 (the label is the status of the disease), and this is the same thing we want to model. The 580 formulation is as follows: 582 The specification of the deep learning model is such that a binary classification task is 583 accomplished. The final layer has a binary outcome, the last activation function is sigmoid, 584 and loss function is binary cross-entropy, and the optimization algorithm is Adam.

585
The implementation Details