Computational and experimental hunt for expansion prone tandem CNG repeats in 2 human genomes

28 Introduction – Spinocerebellar ataxias (SCA) are a group of clinically and genetically


Introduction:
Spinocerebellar ataxia is a group of neurodegenerative disorders.The most common genetic mutation in SCAs is repeat expansion in the coding or noncoding gene region.To date, more than 40 genetic repeat loci were reported associated with ataxic dysfunctions still 20-60% of cases are genetically unclassified worldwide (1) (Fig- 1).The available literature, various databases as well as our in-house data explain the role of CNG as the most prevalent cause of SCA in autosomal dominant inheriting patients.Thus, prioritization of such loci becomes essential for successful candidate gene identification.The majority of the SCA subtypes are geographical region specific and different countries are dealing with specific subtypes.In the north Indian SCA cohort, SCA 2 and SCA 12 are the .CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is The copyright holder for this preprint this version posted May 3, 2023.; https://doi.org/10.1101/2023.05.03.539188 doi: bioRxiv preprint most common autosomal dominant cerebellar ataxias (ADCAs) and FRDA in the autosomal recessive cerebellar ataxias (ARCAs) sub-type (2), whereas SCA 3 is the most common subtype worldwide (3).In the Indian spinocerebellar ataxia patients, ~ 60% are still genetically uncharacterized.This raises the possibility of other trinucleotide repeat expansion might be responsible for these cases.
Identification of novel repeat loci is a very difficult task due to the rarity of the disease, clinical heterogeneity in symptoms, difficulty to amplify, or even finding from short-read sequencing data, high cost of long-read sequencing, and many more.
In 2004 Pandey et al. tried a different type of approach and computationally reviewed the entire CAG repeats in the genome and identified two CAG loci as putative candidates for SCA disorder (4).In this study, we used a combination of computational as well as genetic approaches to identify possible disease-causing unstable repeat loci in our population.
Control -The control samples (N=100) were made available from the DNA repository of the Indian Genome Variation Consortium project (5).We divided the analysis into two stages (Fig 1), with the first focusing on finding the genome's unstable CNG sites and the second trying to find out .

A. In-silico identification of repeat loci in the human reference genome
The repeat sequences, CAG, CTG, CCG, CGG, and GCC which were reported for various ADCA sub-types were considered for this study ( The python modules "Seq", and "SeqIO" from the "Bio" package were used to read chromosome-wise fasta sequences and RegEx was generally used to find a sequence of strings in a specific pattern of repeats.

Sub-Grouping of repeats on the bases of repeat number -The result of the program yielded
15069 repeats with more than 3 continuous repeat units.For better downstream analysis the repeats were duly categorized into three sub-types based on the length of the repeat (Table 1) After classification of the repeats, to know the genomic feature the chromosome locations of each repeat were taken to get annotated.ANNOVAR (6) was used to annotate the gene information of each repeat.These annotated repeat groups were further classified based on their functional significance

B. Identification of unstable CNG repeat loci
To find out the polymorphic status in our population, we screened all the selected 52 CNG repeat loci in 100 control samples.To make our study cost-effective, we added an M13specific nucleotide tag sequence on the 5' end of all forward primers.Therefore, we used 3 primers for every PCR reaction (FP, RP, and fluorescent-labelled M13 tag primer).
For PCR amplification of selected 52 loci, we used different master mixes (Epicentre's failsafe mixes or Promega master mix) along with 25 ng of DNA, 0.1 µl of forward primer, 0.4 µl of reverse primer, and 0.4 µl of M13 tag primer of 10 pM/µl working concentration in 10µl reaction volume.The PCR conditions were 95 °C for 3 min followed by 35 cycles of denaturation at 95 °C for 30 sec., annealing at 60 °C for 30 sec, extension at 72 °C for 30 sec, followed by a final extension at 72 °C for 5 minutes.The samples were analysed using fragment analyser and visualized on gene mapper software (version 4, Applied Biosystems).After the repeat number calculation of all the loci and found 2 types of repeat status i) stable loci and ii) unstable loci (> ± 3 repeats variability).The highly polymorphic loci (Supplementary Table -1) were further selected for screening in patient samples.

Results
From genome wide CNG repeat selection, we found a total of 15069 loci.The CNG repeats were found to abound in coding and UTR region.The CGG and CCG repeats were present mostly in the 5'UTR region of the gene (Fig- 2).
We have screened all 52 loci with our control samples to know the repeat status in our population.Our data suggested, out of these, 33 loci were quite stable, and 19 loci were polymorphic in nature (Table -1).The unstable targets RAI1, UMAD1, GLS, HTR7P1, CNKSR2, MAML3, MED15, MLLT3, USF3, MEF2A, MIR205HG, NCOR2, RPL14, JPH3, MAB21L1, ANKUB1, ERF, GIPC1, and EP400 were further screened in the patient cohort to know any large repeat variability (Fig - 3).  1).The heterozygosity indexes (HI) of UMAD1, MAB21L1, ANKUB1, GLS, and RPL14 show the repeat loci in these genes are highly polymorphic and greater than 0.7 both in cases and controls.On the other hand, MLLT3 and CNKSR2 were very less polymorphic and had more homozygous repeats (HI ≤ 0.1) in both groups.Most of the target loci fall between the range of 0.3 and 0.7 except ERF which also has a low HI of less than 0.25 in all samples.
. CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is  The GTEx expression database (16) showed that genes CNKSR2, MAB21L1, USF3, RAI1, NCOR2, JPH3, MAML3, EP400, and GLS were significantly highly expressed in the brain, particularly the cerebellum.All the other genes were also showing significant expression in the brain except for MIR205HG (Table -1).Since SCA disorder is associated with the brain, we excluded MIR205HG from the shortlisted gene list; thus, we propose the pathogenicity of the remaining 18 genes which might show ataxia phenotype.

Conclusion
The discovery of repeat instability is an underlying mutation mechanism for several neurodegenerative disorders in humans.It has always been a challenging task to understand the mechanism of repeat instability to disease manifestation.Several distinct hypotheses on the repeat expansion have been proposed over the years but still, its mechanism is not fully understood.Repeat instability in spinocerebellar ataxia contributes to the most prevalent genetic manifestation worldwide.
Our initial phase of the study through computational approach yielded 52 suspected CNG repeat loci from various genes for further investigation on the Indian control population.Using a cost-effective Fluorescent PCR based fragment analysis approach, resulted in 19 conclusive highly polymorphic repeat targets after screening in the control samples.
It has been proven that genetic markers for the same disorder are expressed among various populations in diverse ways.Hence few diseases and genetic markers are population specific.
various populations, it has been reported that CAG repeat variation in MEF2A is a risk factor for coronary artery disease (CAD) (13).A study published in 2020 suggested that CGG repeat expansion mutation in 5'UTR of GIPC1 causes Oculopharyngodistal myopathy (OPDM) an adult-onset inherited neuromuscular disorder (13).A large GCA tandem expansion in 5' UTR of the GLS gene causes overall development delay, progressive ataxia, and elevated levels of glutamine (14).Reported study of GLS, GIPC1, MED15, RAI1, and MEF2A has the same candidate loci that we identified in our study.All this reported evidence gives strength to our study although we did not find any large repeat expansion the approach is in the right direction in the discovery of a novel target.

Limitations of the study:
1. Repeat units: We considered only CNG repeats in coding and UTR regions with at least 10 continuous repeats due to screening hundreds of samples for a larger number of target loci.
Considering other tri, tetra, penta, and hexa repeat units and lower repeat number loci increases the chances of getting causal mutations.
2. Lower sample size: We have taken 100 patient samples for the study.SCA is a rare type of disorder and its subtypes are very rare so considering a larger sample size will give more confidence to our hypothesis.This study highlights the importance of the population polymorphism approach to understand the genetic background and mechanism of tandem repeat instability in ataxia like neurological disorders.The role of other repetitive sequences in both coding and non-coding regions in context with neurological disorders can be explored with the help of computational and polymorphism approaches as taken in this work.

Fig 1 :
Fig 1: Illustration of short tandem repeat [STR] expansion disorders on the basis of their

Fig 2 -
Fig 2-Classification of CNG repeat groups on the basis of functional regions in the genome

3 .
Unavailability of different population samples: It is well known that most of the SCA subtypes are geographic and population specific.In the study, we have considered only north Indian SCA patient samples where a multi-population study could enhance the possibility of finding causal mutation among studied genes.

Table 1 :
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is Classification of different CNG repeats on the basis of their size and functional domains Selection of the repeat loci -The literature and our in-house data suggested that most of the ADCAs or late age of disease onset SCAs have been associated with polymorphic repeat expansion in coding and UTR regions.As the number of target loci is extremely high and unable to screen from conventional methods, we have focused on large CNG repeat loci (≥10 continuous repeats) associated with these regions for this study and finally selected 52 loci.

Figure 3 :
Repeat distribution of target repeats among control and patient samples

Table 1 :
Polymorphic status of selected loci in controls, patients, Heterozygosity index, 1000 genome database and GTEx brain expression .CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is

Table 2 :
Repeat distribution of targeted repeat loci among different populations, our control and patient samples.*Repeat Mode (Minimum repeats -maximum repeats) . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is The copyright holder for this preprint this version posted May 3, 2023.; https://doi.org/10.1101/2023.05.03.539188 doi: bioRxiv preprint