Machine learning predicts translation initiation sites in neurologic diseases with expanded repeats

A number of neurologic diseases, including a form of amyotrophic lateral sclerosis and others associated with expanded nucleotide repeats have an unconventional form of translation called repeat-associated non-AUG (RAN) translation. Repeat protein products accumulate and are hypothesized to contribute to disease pathogenesis. It has been speculated that the repeat regions in the RNA fold into secondary structures in a length-dependent manner, promoting RAN translation. Additionally, nucleotides that flank the repeat region, especially ones closest to the initiation site, are believed to enhance translation initiation. Recently, a machine learning model based on a large number of flanking nucleotides has been proposed for identifying translation initiation sites. However, most likely due to its extensive feature selection and limited training data, the model has diminished predictive power. Here, we overcome this limitation and increase prediction accuracy by a) capturing the effect of nucleotides most critical for translation initiation via feature reduction, b) implementing an alternative machine learning algorithm better suited for limited data, c) building comprehensive and balanced training data (via sampling without replacement) that includes previously unavailable sequences, and, d) splitting ATG and near-cognate translation initiation codon data to train two separate models. We also design a supplementary scoring system to provide an additional prognostic assessment of model predictions. The resultant models have high performance, with 85.00-87.79% accuracy exceeding that of the previously published model by >18%. The models presented here are then used to identify translation initiation sites in genes associated with a number of neurologic repeat expansion disorders. The results confirm a number of experimentally discovered sites of translation initiation upstream of the expanded repeats and predict many sites that are not yet established.


Introduction
predictive models described in our investigations have 85.00-87.79% accuracy that exceeds that this motif is typically accepted as the conserved pattern of the following underlined nucleotides 87 bordering the AUG codon: CCRCCAUGG. The nucleotide designated by R is a purine, most 88 typically adenine [13]. 89 The sequence logo of the KCS (Fig 1) has been used to produce weighted scorings of identified We designed a weighted scoring algorithm based on the KCS sequence logo and the ten bases analyzed, adenine is conserved in about 47% of cases at the position versus that of guanine, with about 37% conservation. Both the bioinformatics study as well as a publication analyzing peptide 150 translation from CUG-initiating mRNA constructs show enhanced translation when guanine is at 151 the +4 position (1 base downstream of the initiation codon) [18,25]. In the KCS, guanine is most 152 conserved at the +4 position as well.

153
Because of these similarities, we decided to apply the algorithm to score known near-cognate 154 codons that have been shown to initiate translation (Fig 3)  to process by a human, and learn from mistakes to improve over time [29]. Although biological pathways are often sophisticated and produce remarkably diverse data, machine learning models 242 can provide direction for such processes that are not completely understood. 243 We decided to implement a random forest classifier (RFC   In addition to feature reduction, our implementation of the random forest classifier, which is 351 more robust to outliers and erroneous instances (especially when data is limited), creation of two 352 models to account for properties of different data types (i.e., ATG codons versus near-cognate codons), and use of sampling without replacement which preserves natural variations found in 354 data (in place of bootstrapping) could explain our improved model performance. other models designed for the same function, including TITER, which they exceed by more than 369 18% in accuracy. As a next step, we decided to use the RFCs to identify repeat-length-370 independent (RLI) translation initiation associated with neurologic diseases.

371
To do this, the RFC models were implemented into software. Developed in Python, the program 372 could be used to evaluate a total sequence consisting of the upstream region, followed by ten 373 nucleotide sequence repeats to represent the repeat expansion. Ten sequence repeats may be instances. Thus, these six near-cognate codons are designated only with color-coding without bolding to denote that they should be acknowledged with less confidence. If there is an overlap 399 between predicted initiation codons (i.e., one or two nucleotides overlap between predicted 400 codons), the color of the overlapped region is the same as that of the next predicted codon to 401 prevent confusion. The overlapped region may or may not be bolded depending on whether the 402 software was trained on this next codon. We also output the KSSs of each predicted codon to two 403 decimal points, as the score could be a useful metric to evaluate translation initiation likelihood.

404
This may be approximated through comparison of KSSs of a codon to the reference table and 405 graph (Fig 6).  (Table 1).

429
Comparison between the predictions and experimentally identified translation initiation codons 430 demonstrated high performance of the software. In fact, all translation initiation sites previously confidence in predictions, out of concern they may be false positives. to make predictions for translation initiation codons for other genes with repeats associated with 450 neurologic repeat diseases, HTT, and DM2 (Fig 12). Predicted translation initiation codons with relatively high KSSs were noted for all analyzed genes (Table 2). In all cases, predicted translation initiation sites are not shown if they have a downstream stop codon located in the 453 same reading frame before the repeat.

462
* Every CTG within the repeat is predicted to possibly initiate translation.

463
† Every CTG within repeat, aside from the first one, is predicted to possibly initiate translation.

467
* Every CTG within the repeat, aside from the first one, is predicted to possibly initiate translation.

468
† Two predicted translation initiation codons are within repeat.    (Table 1). Namely, there 497 is no predicted ATG located 17 bases upstream of the repeat expansion, nor a predicted ATT

525
As shown here, RFCs were able to successfully predict most translation initiation codons 526 associated with neurologic repeat expansion diseases that were experimentally identified. The 527 same models also predicted other codons to initiate translation of repeat expansions for 528 neurologic diseases, that have not been identified. Of note, this software predicted translation 529 initiation sites with more than 18% accuracy than the TITER neural network.

530
Regardless of the quality of a model, its predictions should not be interpreted as evidence.

531
Instead, predictions should be recognized as likely possibilities that warrant further investigation.

532
The significance of the algorithm's identification of translation initiation codons, however, 533 should not be understated. For example, these data may be important to use to guide treatment of 534 these repeat diseases.

535
Although the machine learning models show promise in understanding of the pathogenesis of 536 repeat expansion neurologic disorders, their use may be extended to other applications as well.

537
For example, they may be used to predict the translation initiation that are not involved in repeat 538 expansion disorders. One benefit of this implementation includes the ability to speculate protein 539 products from a nucleotide sequence, quickly and easily and without laboratory procedures. In   Using the open-source package, imbalanced-learn, in Python, we created the RFC models [33].

584
The ATG RFC was trained on an imbalanced set of 12,432 ATG codons known to initiate translation (positives), and 3,261 ATG codons that are believed not to initiate translation 586 (negatives). The set of 3,261 negatives consisted of 1,716 sequences that were not missing 587 nucleotides, and 1,545 (ten percent fewer) randomly sampled negatives of the remaining 31,697 588 that were missing nucleotides. To clarify, missing nucleotides are registered in the case that a 589 recorded codon is located exceedingly close to the 5' or 3' terminus of an mRNA construct. In 590 such a circumstance, there may not be a full ten bases both preceding and following the codon.

591
The sampling technique was performed to slightly offset the proportion of negatives with and 592 without missing bases in the opposite direction. In this way, more negatives without missing 593 bases would be used for model training. Using the original imbalanced set of negatives, with the 594 majority missing bases, would cause the model to inaccurately assess the effect of missing 595 nucleotides on a codon's ability to initiate translation. Furthermore, using a slightly larger 596 proportion of negatives that had a complete sequence profile resulted in improved accuracy for 597 distinguishing codons that were not missing nucleotides. This is useful, as sequences are less 598 often encountered with missing nucleotides in real-world applications.

599
To account for the imbalance of positives and negatives, the RFC had decision trees generated 600 from 3,576 negatives, and the same number of randomly sampled positives. One thousand such 601 trees were used, since this number is generally recommended as a starting point for the . Each decision tree also had the requirement of grouping at least two codon instances to a 606 certain classification. This constraint reduced the risk of overfitting, yet still allowed tree 607 capacity to differentiate between subtly differing codons. Thus, the trees could better identify precise feature patterns to associate with a particular classification, and remain reliable in face of 609 new, unencountered data. 610 We evaluated the accuracy of the RFC model with the above configurations. Parameters such as 611 the minimum number of codons to group for classification could then be adjusted to improve 612 predictive power, as necessary. However, parameters were best left unchanged for optimal 613 predictions. To create a separate classifier for near-cognate codons, we repeated the same 614 procedures to create an RFC for near-cognate codons as we had carried out for ATG codons, this 615 time using data available for all near-cognate codons.

617
Accessibility and implementation 618 The software is publicly accessible as an interactive website at www.tispredictor.com.