Tools that infer an organism’s optimal growth temperature (OGT) from sequence data have potential biological and economic implications. Predicting growth temperature can improve understanding of how individual proteins and whole organisms evolve and adapt to their environment and provide insight into how the proteome affects organism fitness. Ideally, protein structure could be used to understand how thermodynamic stability across the whole proteome affects an organism’s fitness. But predicting protein structure is computationally and experimentally difficult, making it useful to predict stability from protein and sequence characteristics. As sequence data becomes more readily available and computational power increases, statistical and computational methods have been developed to identify protein-specific features that affect thermostability. These include linear regression and Bayesian approaches as well as machine learning models like random forest and neural networks (Jensen et al., 2012; Sauer, 2018).
Many models that have been developed to predict OGT use protein or whole-proteome features as predictors. Since our goal is to identify protein features related to OGT, we need an independent method to predict OGT for new organisms. For this reason, we are developing a model to predict OGT using only tRNA features. tRNA molecules are protein-independent, temperature-sensitive, genomic elements that are shared across all domains of life. Most mutations in tRNAs increase temperature sensitivity and result in partial or complete loss of tRNA function, suggesting that tRNAs are also highly adapted to the optimal temperature of their organism (Payea et al., 2018). Our ‘tRNA thermometer’ model uses a CNN and tRNA DNA sequence to predict OGT. We manage to achieve an r^2 of 0.86 (Figure 1), which is comparable with literature models despite a significant reduction in the quantity of input data. In fact, our models work with only about 4,000 nucleotides of sequence - about 0.1% of the genome!
The tRNA thermometer model also led us to interesting insights into which parts of a tRNA matter most for OGT prediction. We found that predictions were highly correlated with the tRNA GC content and the minimum free energy of folding (MFE), which suggests that the CNN picks up on secondary structure even though we do not explicitly provide it secondary structure information. We also found that the model focuses most of its attention on the tRNA T arm and anticodon arm (Figure 2). Mutating the T arm led to significantly higher variance in model OGT predictions, further suggesting that the CNN learned secondary and even tertiary structure characteristics (the T arm interactions with the D arm are important for folding the tRNA into the proper 3-dimensional shape).
The benefits of these models are threefold. First, the number of prokaryote sequences is growing, but additional information is often not available for these species, and developing culture protocols for new species can be challenging. Understanding likely OGT for new species is useful because it provides a starting point for labs wishing to develop culture protocols and further study these species. Knowing OGT may also be useful in industrial processes requiring thermostable proteins, as this can provide insight into which species proteins are likely to be useful in such processes. Second, by using only the tRNA sequences we created a highly focused model that is independent of other cellular components. This is beneficial for downstream comparisons of temperature effects on protein, DNA, or other RNA features of the cell, as the OGT predictions from the tRNA model are independent of other cell components. Third, by using sequence data as direct inputs to the CNN model, we made use of automatic feature extraction and allowed the model to determine which tRNA features were most relevant. These models will be used to predict OGT for further investigations into protein thermal stability.
Figure 1: CNN model performance when data is split randomly (A-C) and split with phylogenetic distance (D-F).
Purple = bacteria species, orange = archaea species.
Figure 2: Average model attention across the tRNA for the model trained on Archaea data with a phylogenetic data split. (A) Mean and standard error for the relative proportion of attention paid to each nucleotide in the tRNA, averaged across all tRNAs. Colors indicate the average positions of the stem-loop structure across all tRNAs: purple=acceptor stem, green=D arm, orange=anticodon arm, blue=T arm. The dark orange bar within the anticodon arm shows the position of the anticodon. (B) Average percent attention paid to each arm as a whole; Acc = acceptor, D = D arm, AC = anticodon arm, T = T arm.
Jensen, D. B., Vesth, T. C., Hallin, P. F., Pedersen, A. G., & Ussery, D. W. (2012). Bayesian prediction of bacterial growth temperature range based on genome sequences. BMC Genomics, 13(Suppl 7), S3.
Payea, M. J., Sloma, M. F., Kon, Y., Young, D. L., Guy, M. P., Zhang, X., … Phizicky, E. M. (2018). Widespread temperature sensitivity and tRNA decay due to mutations in a yeast tRNA. Rna, 24(3), 410–422.
Sauer, D., & Wang, D.-N. (2018). Prediction of Optimal Growth Temperature using only Genome Derived Features. BioRxiv, 1–27.