南京大学学报(自然科学版) ›› 2021, Vol. 57 ›› Issue (4): 627–640.doi: 10.13232/j.cnki.jnju.2021.04.011

• • 上一篇    

一种混合深度神经网络的赖氨酸乙酰化位点预测方法

颜志良, 丰智鹏, 刘丹, 王会青()   

  1. 太原理工大学信息与计算机学院,太原,030024
  • 收稿日期:2021-04-13 出版日期:2021-07-30 发布日期:2021-07-30
  • 通讯作者: 王会青 E-mail:1013208257@qq.com
  • 作者简介:E⁃mail:1013208257@qq.com
  • 基金资助:
    国家自然科学基金(61976150);山西省重点研发计划(高新技术领域)(201903D121151)

A hybrid deep neural network⁃based method for predicting lysine acetylation sites

Zhiliang Yan, Zhipeng Feng, Dan Liu, Huiqing Wang()   

  1. College of Information and Computer,Taiyuan University of Technology,Taiyuan,030024,China
  • Received:2021-04-13 Online:2021-07-30 Published:2021-07-30
  • Contact: Huiqing Wang E-mail:1013208257@qq.com

摘要:

赖氨酸乙酰化(Lysine acetylation,Kace)普遍存在于人体代谢酶中,与多种代谢疾病密切相关,因此准确识别该位点对于代谢疾病治疗的研究具有重要意义.现有的Kace位点预测方法大多采用蛋白质序列层面的信息作为输入,蛋白质结构特性考虑不全面;特征提取时未关注氨基酸残基间顺序相关性,信息丢失严重,降低了预测准确度.提出一种新的Kace位点预测深度学习CL?Kace模型.CL?Kace引入蛋白质结构特性,并与蛋白质原始序列、氨基酸理化属性共同构建位点特征空间,采用卷积神经网络(Convolutional Neural Network,CNN)提取特征;引入双向长短期记忆(Bidirectional Long Short?Term Memory,BiLSTM)网络捕获残基间的顺序依赖关系,以提高网络的抽象能力,识别潜在的Kace位点.实验结果表明,CL?Kace模型优于现有的Kace位点预测器,能够有效地预测潜在的位点.

关键词: 赖氨酸乙酰化, 蛋白质结构特性, 卷积神经网络, 双向长短期记忆网络, 特征学习

Abstract:

Lysine acetylation (Kace) is common in human metabolic enzymes and is closely related to a variety of metabolic diseases. Therefore,accurately identifying this site is of great significance for investigating metabolic disease treatments. Most of existing prediction methods use protein sequence level information as input,and the protein structural properties are not considered comprehensively. During the feature extraction,the sequential correlation between amino acid residues is ignored,and the information loss is serious,which reduces the prediction accuracy. Therefore,we propose a novel Kace site prediction deep learning model,CL?Kace (Kace site Prediction based on Convolutional and Long Short?Term Memory Networks). CL?Kace introduces the protein structural properties and constructs the feature space of the site together with the original protein sequence and the amino acid physicochemical properties. Then,the Convolutional Neural Network (CNN) is used to extract features. We adopt Bidirectional Long Short?Term Memory (BiLSTM) network to capture the sequential dependence between residues to improve the network abstraction ability to identify potential Kace sites. The experimental results show that CL?Kace is superior to the existing Kace site predictors and can effectively predict the potential sites.

Key words: lysine acetylation, protein structural properties, Convolutional Neural Network, Bidirectional Long Short?Term Memory, feature learning

中图分类号: 

  • TP391

图1

CL?Kace的模型框架"

图2

LSTM单元结构"

表1

CL?Kace模型的超参数配置信息"

超参数名称
卷积层数1
卷积核大小3
卷积核数量96
卷积核移动步长1
BiLSTM隐藏单元数128
dropout率0.6
L2正则化项系数0.0001
学习率0.001
批处理大小512
最大epoch数2000
早停策略patience值20
阳性:阴性的类权重比9.0:1.0

表2

本文数据集的统计信息"

物种数据集类型

蛋白质

数目

阳性

样本数

阴性

样本数

人类训练集421220354184097
独立测试集469254019374
大肠杆菌训练集1488724217162
独立测试集1669051759

图3

CL?Kace模型的参数实验结果"

图4

CL?Kace模型的AUC值随Kace位点窗口大小L的变化结果"

图5

CL?Kace模型在人类训练集上的十折交叉验证结果"

表3

CL?Kace模型在人类训练集上的五次十折交叉验证结果"

次数SnSpACCMCCG?meanAUCAUPR
174.11±2.9773.36±2.6773.44±2.1230.70±0.7273.68±0.3881.31±0.2832.93±0.76
270.54±4.1776.21±3.1675.64±2.4531.13±0.7373.23±0.7681.38±0.2933.33±0.88
372.49±4.1174.57±3.4474.37±2.7030.85±0.9273.43±0.5681.32±0.4733.15±0.82
470.18±4.8576.04±3.5275.46±2.7030.77±0.6672.93±0.9381.18±0.4232.82±0.62
574.24±2.9073.30±2.6173.45±2.1030.67±0.7073.29±0.3781.25±0.3132.89±0.73

图6

人类独立测试集样本的初始编码向量和CL?Kace提取的高级抽象特征的t?SNE可视化结果"

表4

CL?Kace消融实验的结果"

组件模型结构(列)
基线模型
DNN→CNN
+BiLSTM
+蛋白质结构特性
MCC17.47%25.90%27.98%30.70%
G?mean58.31%70.41%72.31%73.68%
AUC70.07%78.53%79.72%81.31%
AUPR20.03%28.10%29.36%32.93%

表5

不同对比方法的主要信息"

模型名称信息编码分类算法
MusiteDeepone?of?21编码蛋白质原始序列

CNN+

注意力网络

CapsNet五维氨基酸综合理化性质和一维空缺位置信息

CNN+

胶囊网络

DeepAcetone?hot编码、Blosum62、K?间隔氨基酸对组成、信息增益、AAIndex和位置特异性得分矩阵DNN
PSKAcePred氨基酸组成、蛋白质进化相似性和氨基酸理化属性SVM
EnsemblePail改进的位置加权矩阵编码序列特征集成SVM
GPS?PAIL 2.0BLOSUM62GPS 2.2算法
PHOSIDA蛋白质原始序列信息SVM
ProAcePred氨基酸组成、二元氨基酸编码、位置权重氨基酸组成、K空间氨基酸对组成、平均可及表面积、基于分组的权重编码、KNN进化特征SVM

表6

不同方法在人类训练集上的十折交叉验证性能"

模型名称SnSp(%)ACC(%)MCC(%)G?mean(%)AUC(%)AUPR(%)
MusiteDeep73.39±1.4769.48±1.5269.87±1.2426.97±0.5471.39±0.3378.46±0.5828.05±0.96
CapsNet72.16±1.5970.93±1.2271.05±0.9627.37±0.4171.53±0.3578.40±0.3928.10±0.71
DeepAcet69.51±1.5561.40±1.6165.45±0.6431.02±1.2865.31±0.6570.92±0.8168.50±0.97
PSKAcePred71.25±1.0667.61±0.7969.43±0.4738.89±0.9569.40±0.4676.46±0.5675.34±0.65
CL?Kace74.11±2.9773.36±2.6773.44±2.1230.70±0.7273.68±0.3881.31±0.2832.93±0.76

图7

CL?Kace与其他方法在人类独立测试集上的结果"

图8

不同方法在人类训练集上的十折交叉验证AUC值的方差分析结果"

图9

CL?Kace和其他预测器在人类独立测试集上的ROC曲线、ROC(01)曲线和PR曲线"

表7

CL?Kace与其他方法在大肠杆菌独立测试集上的结果"

模型名称SnSpACCMCCG?meanAUCAUPR
MusiteDeep75.36%66.91%69.78%40.09%71.01%78.15%61.49%
CapsNet71.82%70.15%70.72%40.04%70.98%78.48%62.14%
DeepAcet46.90%75.67%64.28%23.43%59.57%63.32%44.82%
PSKAcePred62.10%53.44%56.38%14.73%57.61%60.87%41.06%
EnsemblePail47.29%54.63%52.14%1.83%50.83%
GPS?PAIL 2.06.08%90.45%61.79%-5.49%23.45%
ProAcePred8.40%95.28%65.77%7.36%28.29%
CL?Kace76.02%69.19%71.51%42.95%72.52%79.44%62.15%

表8

人类独立测试集上前20个候选位点的预测结果"

排名蛋白质位置预测分数排名蛋白质位置预测分数
1P5335090.987811Q8IXK02800.9680
2Q8N8A6900.984012Q13126400.9678
3Q9NQZ2120.981913O14810140.9672
4Q925221430.977114Q9P2704590.9655
5Q9BQ611180.973615Q5T5U3150.9651
6Q9BQ152030.971316Q9UGM63540.9649
7O437191900.970317Q6ZRV27400.9646
8Q96TA26910.969318Q8WUM4190.9643
9P53007190.969219Q99436720.9638
10Q9H7N49430.968420Q8WUR71090.9635
1 Liu Y,Wang M H,Xi J N,et al. PTM?ssMP:A web server for predicting different types of post?translational modification sites using novel site?specific modification profile. International Journal of Biological Sciences,2018,14(8):946-956.
2 Wang D L,Liang Y C,Xu D. Capsule network for protein post?translational modification site prediction. Bioinformatics,2019,35(14):2386-2394.
3 Khoury G A,Baliban R C,Floudas C A. Proteome?wide post?translational modification statistics:Frequency analysis and curation of the swiss?prot database. Scientific Reports,2011,1:90.
4 Nallamilli B R R,Edelmann M J,Zhong X X,et al. Global analysis of lysine acetylation suggests the involvement of protein acetylation in diverse biological processes in rice (Oryza sativa). PLoS One,2014,9(2):e89283.
5 朱志坚,王兵,葛玮等. 血清组蛋白去乙酰化酶3对稳定性冠心病患者经皮冠状动脉介入治疗术后主要心血管不良事件的预测价值. 中国医师进修杂志,2020,43(10):939-943.
Zhu Z J,Wang B,Ge W,et al. Predictive value of serum histone deacetylase 3 on major adverse cardiovascular events in patients with stable coronary artery disease after percutaneous coronary intervention. Chinese Journal of Postgraduates of Medicine,2020,43(10):939-943.
6 Shao J L,Xu D,Hu L D,et al. Systematic analysis of human lysine acetylation proteins and accurate prediction of human lysine acetylation through bi?relative adapted binomial score Bayes feature representation. Molecular BioSystems,2012,8(11):2964-2973.
7 Deng W K,Wang C W,Zhang Y,et al. GPS?PAIL:Prediction of lysine acetyltransferase?specific modification sites from protein sequences. Scientific Reports,2016(6):39787.
8 Butler C A,Veith P D,Nieto M F,et al. Lysine acetylation is a common post?translational modification of key metabolic pathway enzymes of the anaerobe Porphyromonas gingivalis. Journal of Proteomics,2015(128):352-364.
9 Zhao S M,Xu W,Jiang W Q,et al. Regulation of cellular metabolism by protein lysine acetylation. Science,2010,327(5968):1000-1004.
10 Li A,Xue Y,Jin C J,et al. Prediction of Nε?acetylation on internal lysines implemented in bayesian discriminant method. Biochemical and Biophysical Research Communications,2006,350(4):818-824.
11 Lee T Y,Hsu J B K,Lin F M,et al. N?Ace:Using solvent accessibility and physicochemical properties to identify protein N?acetylation sites. Journal of Computational Chemistry,2010,31(15):2759-2771.
12 施绍萍,索生宝,邱建丁. 组合二级结构信息预测赖氨酸甲基化和乙酰化∥中国化学会第28届学术年会论文集. 成都:中国化学会,2012:1.
Shi S P,Suo S B,Qiu J D. Incorporating secondary structure for identification of lysine methylation and lysine acetylation∥Proceedings of the 28th Annual Conference of the Chinese Chemical Society. Chengdu,China:Chinese Chemical Society,2012:1.
13 Xu Y,Wang X B,Ding J,et al. Lysine acetylation sites prediction using an ensemble of support vector machine classifiers. Journal of Theoretical Biology,2010,264(1):130-135.
14 索生宝,孙兴玉,邱建丁. 结合多特征算法和信息熵预测蛋白质乙酰化位点∥第十一届全国计算(机)化学学术会议论文集. 兰州:中国化学会,2011:51.
Suo S B,Sun X Y,Qiu J D. Combining Multi?feature algorithm and information entropy to analyze protein lysine acetylation∥Proceedings of the 11th National Conference on Computational Chemistry of the Chinese Chemical Society. Lanzhou,China:Chinese Chemical Society,2011:51.
15 Gnad F,Gunawardena J,Mann M. PH
OSIDA2011:The posttranslational modification database. Nucleic Acids Research,2011,39(S1):D253-D260.
16 Chen G D,Cao M,Luo K,et al. ProAcePred:prokaryote lysine acetylation sites prediction based on elastic net feature optimization. Bioinformatics,2018,34(23):3999-4006.
17 Hou T,Zheng G Y,Zhang P Y,et al. LAceP:Lysine acetylation site prediction using logistic regression classifiers. PLoS One,2014,9(2):e89575.
18 Lu Z K,Cheng Z Y,Zhao Y M,et al. Bioinformatic analysis and post?translational modification crosstalk prediction of lysine acetylation. PLoS One,2011,6(12):e28228.
19 Heffernan R,Yang Y D,Paliwal K,et al. Capturing non?local interactions by long short?term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure,backbone angles,contact numbers and solvent accessibility. Bioinformatics,2017,33(18):2842-2849.
20 Reddy H M,Sharma A,Dehzangi A,et al. GlyStruct:Glycation prediction using structural properties of amino acid residues. BMC Bioinformatics,2019,19(S13):547.
21 López Y,Sharma A,Dehzangi A,et al. Success:Evolutionary and structural properties of amino acids prove effective for succinylation site prediction. BMC Genomics,2018,19(S1):923.
22 Chandra A,Sharma A,Dehzangi A,et al. Phoglystruct:Prediction of phosphoglycerylated lysine residues using structural properties of amino acids. Scientific Reports,2018(8):17923.
23 Wang D L,Zeng S,Xu C H,et al. MusiteDeep:A deep?learning framework for general and kinase?specific phosphorylation site prediction. Bioinformatics,2017,33(24):3909-3916.
24 He F,Wang R,Li J G,et al. Large?scale prediction of protein ubiquitination sites using a multimodal deep architecture. BMC Systems Biology,2018,12(S6):109.
25 Long H X,Liao B,Xu X Y,et al. A hybrid deep learning model for predicting protein hydroxylation sites. International Journal of Molecular Sciences,2018,19(9):2817.
26 Guo Y B,Li W H,Wang B Y,et al. DeepACLSTM:Deep asymmetric convolutional long short?term memory neural models for protein secondary structure prediction. BMC Bioinformatics,2019(20):341.
27 Luo F L,Wang M H,Liu Y,et al. DeepPhos:Prediction of protein phosphorylation sites with deep learning. Bioinformatics,2019,35(16):2766-2773.
28 Kiemer L,Bendtsen J D,Blom N. NetAcet:Prediction of N?terminal acetylation sites. Bioinformatics,2005,21(7):1269-1270.
29 Wu M Q,Yang Y X,Wang H,et al. A deep learning method to more accurately recall known lysine acetylation sites. BMC Bioinformatics,2019,20(1):49.
30 Atchley W R,Zhao J P,Fernandes A D,et al. Solving the protein sequence metric problem. Proceedings of the National Academy of Sciences of the United States of America,2005,102(18):6395-6400.
31 Xu H D,Zhou J Q,Lin S F,et al. PLMD:An updated data resource of protein lysine modifications. Journal of Genetics and Genomics,2017,44(5):243-250.
32 Huang Y,Niu B F,Gao Y,et al. CD?HIT Suite:A web server for clustering and comparing biological sequences. Bioinformatics,2010,26(5):680-682.
33 Chicco D,Jurman G. The advantages of the matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics,2020,21(1):6.
34 He H B,Garcia E A. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering,2009,21(9):1263-1284.
35 Van Der Maaten L,Hinton G. Visualizing data using t?SNE. Journal of Machine Learning Research,2008(9):2579-2605.
36 Ma J Z,Yu M K,Fong S,et al. Using deep learning to model the hierarchical structure and function of a cell. Nature Methods,2018,15(4):290-298.
37 Chen L,Zhang H W,Xiao J,et al. SCA?CNN:Spatial and channel?wise attention in convolutional networks for image captioning∥Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu,HI,USA:IEEE,2017:5659-5667.
[1] 段建设, 崔超然, 宋广乐, 马乐乐, 马玉玲, 尹义龙. 基于多尺度注意力融合的知识追踪方法[J]. 南京大学学报(自然科学版), 2021, 57(4): 591-598.
[2] 方志文, 刘青山, 周峰. 基于像素⁃目标级共生关系学习的多标签航拍图像分类方法[J]. 南京大学学报(自然科学版), 2021, 57(2): 208-216.
[3] 范习健, 杨绪兵, 张礼, 业巧林, 业宁. 一种融合视觉和听觉信息的双模态情感识别算法[J]. 南京大学学报(自然科学版), 2021, 57(2): 309-317.
[4] 高春永, 柏业超, 王琼. 基于改进的半监督阶梯网络SAR图像识别[J]. 南京大学学报(自然科学版), 2021, 57(1): 160-166.
[5] 李一凡, 朱斐, 凌兴宏, 刘全. 具有窗口结构Bi⁃LSTM网络的心电图QRS波检测方法[J]. 南京大学学报(自然科学版), 2021, 57(1): 42-51.
[6] 潘越,王骏,李文飞,张建,王炜. 基于卷积神经网络的蛋白质折叠类型最小特征提取[J]. 南京大学学报(自然科学版), 2020, 56(5): 744-753.
[7] 梅志伟,王维东. 基于FPGA的卷积神经网络加速模块设计[J]. 南京大学学报(自然科学版), 2020, 56(4): 581-590.
[8] 朱伟,张帅,辛晓燕,李文飞,王骏,张建,王炜. 结合区域检测和注意力机制的胸片自动定位与识别[J]. 南京大学学报(自然科学版), 2020, 56(4): 591-600.
[9] 赵子龙,赵毅强,叶茂. 基于FPGA的多卷积神经网络任务实时切换方法[J]. 南京大学学报(自然科学版), 2020, 56(2): 167-174.
[10] 罗春春,郝晓燕. 基于双重注意力模型的微博情感倾向性分析[J]. 南京大学学报(自然科学版), 2020, 56(2): 236-243.
[11] 王吉地,郭军军,黄于欣,高盛祥,余正涛,张亚飞. 融合依存信息和卷积神经网络的越南语新闻事件检测[J]. 南京大学学报(自然科学版), 2020, 56(1): 125-131.
[12] 狄 岚, 何锐波, 梁久祯. 基于可能性聚类和卷积神经网络的道路交通标识识别算法[J]. 南京大学学报(自然科学版), 2019, 55(2): 238-250.
[13] 胡 太, 杨 明. 结合目标检测的小目标语义分割算法[J]. 南京大学学报(自然科学版), 2019, 55(1): 73-84.
[14] 安 晶, 艾 萍, 徐 森, 刘 聪, 夏建生, 刘大琨. 一种基于一维卷积神经网络的旋转机械智能故障诊断方法[J]. 南京大学学报(自然科学版), 2019, 55(1): 133-142.
[15] 梁蒙蒙1,周 涛1,2*,夏 勇3,张飞飞1,杨 健1. 基于随机化融合和CNN的多模态肺部肿瘤图像识别[J]. 南京大学学报(自然科学版), 2018, 54(4): 775-.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!