Student Rutgers, The State University of New Jersey Edison, New Jersey, United States
Vijay Anand (Rutgers, The State University of New Jersey)| Ziping Liu (Rutgers, The State University of New Jersey)| Andrew Gow (Rutgers, The State University of New Jersey)
S-nitrosylation is a post-translational modification (PTM) in proteins that is critical for many biochemical processes in cells. Though experimental procedures exist for studying and determining S-nitrosylation, computer modeling presents a more convenient and cost-effective method to determine potential sites of nitrosylation. Previous studies have explored the use of different Machine Learning models to predict S-nitrosylation in proteins; however, the type of Machine Learning methods and the selection of input features that will yield the best predictive performance are still being studied. The goal of this project was to optimize the performance of a 3-layer Artificial Neural Network (ANN) using S-nitrosylation primary protein structure data. Primary structure data was taken from the dbSNO database, with each protein entry processed into instances of cysteines that were and were not S-nitrosylated. This entire dataset contained 4150 S-nitrosylation instances and 18506 non-S-nitrosylation instances, which was then split into a training set and a testing set at a ratio of 70:30, respectively. The performance of the 3-layer ANN was assessed via the Mathew Correlation Coefficient (MCC), Area Under the Receiver Operating Characteristic (AROC) curve, specificity, recall, accuracy, and precision values for the testing set. Mini-batch sizes, iterations per run, number of hidden neurons per layer, and other hyper-parameters were modified between runs to gauge improvements in the ANN performance. The 3-layer ANN was able to achieve an average MCC, AROC, specificity, recall, accuracy, and precision of 0.265, 0.685, 0.784, 0.467, 0.625, and 0.687, respectively with 50 neurons at both hidden layers, 150 iterations, learning rate of 0.05, 64-size mini-batch, and window size of 40. As the ANN’s performance cannot be significantly improved beyond the metrics presented above, the addition of secondary structure sequence data from Protein Data Bank (PDB) files may improve the predictive performance of the 3-layer ANN and will be tested. The data presented above demonstrates the 3-layer ANN’s predictive performance and suggests that the use of additional data, such as secondary structure, could allow for improvement.