南京大学学报(自然科学版) ›› 2023, Vol. 59 ›› Issue (6): 919–927.doi: 10.13232/j.cnki.jnju.2023.06.002

• • 上一篇    下一篇

机器学习在蛋白质疏水相互作用模型研究中的应用

冯晨博, 马维强, 程润, 王骏()   

  1. 南京大学物理学院,南京,210093
  • 收稿日期:2023-09-27 出版日期:2023-11-30 发布日期:2023-12-06
  • 通讯作者: 王骏 E-mail:wangj@nju.edu.cn
  • 基金资助:
    国家自然科学基金(11774157)

Application of machine learning in the study of the hydrophobic interaction model of proteins

Chenbo Feng, Weiqiang Ma, Run Cheng, Jun Wang()   

  1. School of Physics,Nanjing University,Nanjing,210093,China
  • Received:2023-09-27 Online:2023-11-30 Published:2023-12-06
  • Contact: Jun Wang E-mail:wangj@nju.edu.cn

摘要:

疏水相互作用是一种十分复杂的非线性多体等效相互作用,在蛋白质折叠中发挥着主导作用,对蛋白质溶剂可及表面积(SASA)的分析是刻画该作用的重要手段.为了解决SASA解析或数值方法难以平衡计算成本和精确度的问题,将机器学习方法应用于蛋白质SASA的预测中.与传统的典型方法进行比较,该方法得到的结果,误差小了一个数量级,计算速度比解析方法提升了近两个数量级.将该方法拓展到基于蛋白质粗粒化结构的SASA预测上,也取得了良好的结果.该方法为蛋白质物理的研究提供了新的高效计算工具.

关键词: 蛋白质折叠, 疏水相互作用, 溶剂可及表面积(SASA), 机器学习

Abstract:

Hydrophobic interaction is a nonlinear effective interaction with a highly complex many?body feature. This interaction plays a dominant role in protein folding. The solvent?accessible surface area (SASA) of proteins is a typical means to characterize this interaction. To solve the imbalance between computational cost and accuracy in the analytical or numerical methods of the SASA,in this work,we apply the machine learning method to the prediction of protein SASA. Compared with the traditional typical methods,the error is roughly one order smaller,and the calculation speed is nearly two orders faster. In addition,we extend this method to predict the SASA of proteins based on coarse?grained structures. Good predictions are also achieved. These results provide new efficient computational tools for the study of protein physics.

Key words: protein folding, hydrophobic interaction, solvent?accessible surface area (SASA), machine learning

中图分类号: 

  • Q615

图1

原子i的局域坐标系示意图"

图2

一定近邻数以内的原子SASA之和所占比例与近邻数M的关系"

图3

SASA预测方法和神经网络架构示意图"

图4

蛋白质中20万原子SASA的直方图统计"

图5

使用蛋白质全原子结构训练的模型的学习曲线"

图6

(a)单原子SASA的预测值和解析值对比;(b)单原子SASA预测值的平均绝对误差与SASA的关系"

图7

(a)蛋白质总SASA的预测值SASApre和解析值SASAact的对比;(b)预测相对误差δ与原子数N的关系"

表1

折叠和解折叠态蛋白预测误差比较"

蛋白ID

折叠态

解析值

折叠态

预测值

相对

误差

解折叠态

解析值

解折叠态

预测值

相对

误差

dlsr4b_11837118920.459%14070142411.21%
dlssxa_784478850.523%12938130370.765%
dlu8fo2897389440.323%10577106790.965%

图8

神经网络预测精度和训练集容量Nsample的关系"

图9

不同方法对不同大小蛋白质SASA的预测用时对比"

图10

亲水残基(a)和疏水残基(b) SASAres的直方图统计"

图11

(a)基于α碳结构的蛋白质总SASA预测值与解析值对比;(b)预测相对误差δres与蛋白质链长的关系"

1 Bu?a J, D?urina J, Hayryan E,et al. ARVO:A Fortran package for computing the solvent accessible surface area and the excluded volume of overlapping spheres via analytic equations. Computer Physics Communications2005165(1):59-96.
2 Yan Z Q, Wang J. Optimizing the affinity and specificity of ligand binding with the inclusion of solvation effect. Proteins:Structure,Function,and Bioinformatics201583(9):1632-1642.
3 Mennucci B, Tomasi J. Continuum solvation models:A new approach to the problem of solute's charge distribution and cavity boundaries. The Journal of Chemical Physics1997106(12):5151-5158.
4 Eisenberg D, McLachlan A D. Solvation energy in protein folding and binding. Nature1986319(6050):199-203.
5 Lee B, Richards F M. The interpretation of protein structures:Estimation of static accessibility. Journal of Molecular Biology197155(3):379?IN4.
6 Zou Z X, Chen K Y, Shi Z W,et al. Object detection in 20 years:A survey. Proceedings of the IEEE2023111(3):257-276.
7 Huang S, Papernot N, Goodfellow I,et al. Adversarial attacks on neural network policies. 2017,arXiv:.
8 Hossain Z, Sohel F, Shiratuddin M F,et al. A comprehensive survey of deep learning for image captioning. ACM Computing Surveys201951(6):118.
9 Janai J, Güney F, Behl A,et al. Computer vision for autonomous vehicles:Problems,datasets and state of the art. Foundations and Trends? in Computer Graphics and Vision202012(1-3):1-308.
10 Young T, Hazarika D, Poria S,et al. Recent trends in deep learning based natural language processing. IEEE Computational Intelligence Magazine201813(3):55-75.
11 Minaee S, Kalchbrenner N, Cambria E,et al. Deep learning?based text classification:A comprehensive review. ACM Computing Surveys202254(3):62.
12 Garbacea C, Mei Q Z. Neural language generation:Formulation,methods,and evaluation. 2020,arXiv:.
13 Tay Y, Dehghani M, Bahri D,et al. Efficient transformers:A survey. ACM Computing Surveys202355(6):109.
14 Silver D, Huang A, Maddison C J,et al. Mastering the game of Go with deep neural networks and tree search. Nature2016529(7587):484-489.
15 Silver D, Schrittwieser J, Simonyan K,et al. Mastering the game of go without human knowledge. Nature2017550(7676):354-359.
16 Justesen N, Bontrager P, Togelius J,et al. Deep learning for video game playing. IEEE Transactions on Games202012(1):1-20.
17 Vinyals O, Babuschkin I, Czarnecki W M,et al. Grandmaster level in StarCraft II using multi?agent reinforcement learning. Nature2019575(7782):350-354.
18 Bakator M, Radosav D. Deep learning and medical diagnosis:A review of literature. Multimodal Technologies and Interaction20182(3):47.
19 De Fauw J, Ledsam J R, Romera?Paredes B,et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nature Medicine201824(9):1342-1350.
20 Chen H M, Engkvist O, Wang Y H,et al. The rise of deep learning in drug discovery. Drug Discovery Today201823(6):1241-1250.
21 Lavecchia A. Deep learning in drug discovery:Opportunities,challenges and future prospects. Drug Discovery Today201924(10):2017-2032.
22 Jumper J, Evans R, Pritzel A,et al. Highly accurate protein structure prediction with AlphaFold. Nature2021596(7873):583-589.
23 Cramer P. AlphaFold2 and the future of structural biology. Nature Structural & Molecular Biology202128(9):704-705.
24 Nash W, Drummond T, Birbilis N. A review of deep learning in the study of materials degradation. npj Materials Degradation20182(1):37.
25 Agrawal A, Choudhary A. Deep materials informatics:Applications of deep learning in materials science. MRS Communications20199(3):779-792.
26 Baldi P, Sadowski P, Whiteson D. Searching for exotic particles in high?energy physics with deep learning. Nature Communications20145(1):4308.
27 Guest D, Cranmer K, Whiteson D. Deep learning and its application to LHC physics. Annual Review of Nuclear and Particle Science2018(68):161-181.
28 Zhang L F, Han J Q, Wang H,et al. Deep potential molecular dynamics:A scalable model with the accuracy of quantum mechanics. Physical Review Letters2018120(14):143001.
29 Fox N K, Brenner S E, Chandonia J M. SCOPe:Structural Classification of Proteins?extended,integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Research201442(D1):D304-D309.
30 Chandonia J M, Guan L, Lin S Y,et al. SCOPe:Improvements to the structural classification of proteins?extended database to facilitate variant interpretation and machine learning. Nucleic Acids Research202250(D1):D553-D559.
31 Hayryan S, Hu C K, Sk?ivánek J,et al. A new analytical method for computing solvent‐accessible surface area of macromolecules and its gradients. Journal of Computational Chemistry200526(4):334-343.
32 Fraczkiewicz R, Braun W. Exact and efficient analytical calculation of the accessible surface areas and their gradients for macromolecules. Journal of Computational Chemistry199819(3):319-333.
33 Mitternacht S. FreeSASA:An open source C library for solvent accessible surface area calculations Version 1. Peer Review:2 approved. F1000Research,2016(5):189.
34 Center for Informational Biology,Ochanomizu University. Accessible surface area and accessibility calculation for protein. http:∥cib.cf.ocha.ac.jp/bitool/ASA/2012.
35 Lam S K, Pitrou A, Seibert S. Numba:A LLVM?based python JIT compiler∥Proceedings of the 2nd Workshop on the LLVM Compiler Infrastructure in HPC. Austin,TX,USA:ACM,Article No.7,2015.
36 Srivastava N, Hinton G, Krizhevsky A,et al. Dropout:A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research201415(1):1929-1958.
[1] 李灏天, 刘晓宙, 何爱军. 基于机器学习和超声成像的缺陷识别与分析[J]. 南京大学学报(自然科学版), 2022, 58(4): 670-679.
[2] 杜渊洋, 邓成伟, 张建. 基于深度卷积神经网络的RNA三维结构打分函数[J]. 南京大学学报(自然科学版), 2022, 58(3): 369-376.
[3] 高菲, 杨柳, 李晖. 开放集识别研究综述[J]. 南京大学学报(自然科学版), 2022, 58(1): 115-134.
[4] 李苓玉, 刘治平. 基于机器学习的自发性早产生物标记物发现[J]. 南京大学学报(自然科学版), 2021, 57(5): 767-774.
[5] 贾霄, 郭顺心, 赵红. 基于图像属性的零样本分类方法综述[J]. 南京大学学报(自然科学版), 2021, 57(4): 531-543.
[6] 崔鹤, 刘昆, 瞿晓磊. 基于紫外⁃可见光谱和机器学习方法的溶解性有机质吸附预测模型研究[J]. 南京大学学报(自然科学版), 2021, 57(3): 356-363.
[7] 潘越,王骏,李文飞,张建,王炜. 基于卷积神经网络的蛋白质折叠类型最小特征提取[J]. 南京大学学报(自然科学版), 2020, 56(5): 744-753.
[8] 曹欣怡,李鹤,王蔚. 基于语料库的语音情感识别的性别差异研究[J]. 南京大学学报(自然科学版), 2019, 55(5): 758-764.
[9] 阚 威, 李 云. 基于LSTM的脑电情绪识别模型[J]. 南京大学学报(自然科学版), 2019, 55(1): 110-116.
[10] 朱亚奇1,邓维斌1 ,2*. 一种基于不平衡数据的聚类抽样方法[J]. 南京大学学报(自然科学版), 2015, 51(2): 421-429.
[11] 朱亚奇1,邓维斌1,2*. 一种基于不平衡数据的聚类抽样方法[J]. 南京大学学报(自然科学版), 2015, 51(2): 421-429.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!