南京大学学报(自然科学版) ›› 2020, Vol. 56 ›› Issue (5): 744–753.doi: 10.13232/j.cnki.jnju.2020.05.014

• • 上一篇    

基于卷积神经网络的蛋白质折叠类型最小特征提取

潘越1,王骏1,2,3(),李文飞1,2,张建1,2,王炜1,2()   

  1. 1.南京大学物理学院,南京,210023
    2.南京大学脑科学研究院,南京,210023
    3.南京大学计算机软件新技术国家重点实验室,南京大学计算机科学与技术系,南京,210023
  • 收稿日期:2020-08-10 出版日期:2020-09-30 发布日期:2020-09-29
  • 通讯作者: 王骏,王炜 E-mail:wangj@nju.edu.cn;wangwei@nju.edu.cn
  • 基金资助:
    国家自然科学基金(11774157)

Extraction of minimal representation of protein folds based on convolutional neural network

Yue Pan1,Jun Wang1,2,3(),Wenfei Li1,2,Jian Zhang1,2,Wei Wang1,2()   

  1. 1.School of Physics,Nanjing University,Nanjing,210023,China
    2.Institute for Brain Sciences,Nanjing University, Nanjing,210023,China
    3.State Key Laboratory for Novel Software Technology of Nanjing University,Department of Computer Science and Technology,Nanjing University,Nanjing,210023,China
  • Received:2020-08-10 Online:2020-09-30 Published:2020-09-29
  • Contact: Jun Wang,Wei Wang E-mail:wangj@nju.edu.cn;wangwei@nju.edu.cn

摘要:

通过蛋白质的序列、结构等信息构建完整的蛋白质宇宙是生物信息学中的重要课题,相关研究对蛋白质结构预测、蛋白质进化路径分析以及蛋白质结构设计等方面的研究都有重要的意义.从蛋白质结构的一种简化表示——蛋白质接触图出发,通过训练卷积神经网络进行特征提取,筛选出可识别结构域折叠类型的最小特征向量,构建蛋白质折叠类型空间,并使用谱聚类等方法对不同蛋白质折叠类型的高维分布情况进行分析.得到的最小特征向量兼顾了信息的完整性与冗余度,可以很好地表示全部七种常见蛋白质类的空间关联.该研究结果填补了之前蛋白质宇宙研究中对不常见类的空间位置和相互关系描述的空白,加深了对于蛋白质结构相似性的理解.

关键词: 蛋白质宇宙, 深度学习, 卷积神经网络, 蛋白质折叠类型识别

Abstract:

Establishing an entire protein universe from sequential and structural information is a key problem in bioinformatics,and is of great importance in protein structure prediction,protein evolution analysis and protein structure design. In this paper,starting from a simplified representaion of protein structure,contact map,we trained a deep convolutional neural network (DCNN) and studied the shortest feature vectors that were able to recognize different protein folds correctly. We constructed a space of protein folds spanned with these shortest feature vectors,and analyzed the high?dimensional distribution with spectral clustering and other methods. With these shortest feature vectors,both information integrity and redundance are considered and all the seven common protein classes and their spatial relationships can be characterized. Our research fills gaps in the description of spatial position and relationship of classes which is absent from previous researches and may improve the understanding on similarity between protein classes.

Key words: protein universe, deep learning, convolutional neural network, protein fold recognition

中图分类号: 

  • Q61

表1

不同类别的结构域在训练集和验证集中的数目"

All α类All β类α/β类α+β类多结构域类膜蛋白类小蛋白类
训练集44916770739060334664411409
验证集4768508596616446229

图1

结构域d12asa_的蛋白质接触图The white pixels indicate contacts (namely the corresponding distances between residues are shorter than the presumed threshold), and the black ones imply there are no contacts between the concerned residues."

图2

本文使用的DCNN模型"

图3

(A)输出不同特征向量长度的DCNN在验证集中进行折叠类型识别的正确率;(B) 对应不同折叠类型特征向量长度条件下调整兰德系数与聚类簇数目的关系"

图4

(A)在聚类簇数目为7时对角化的混淆矩阵;(B)聚类簇数目为2~8时的类标签的聚类过程层次图The numbers of the folds for corresponding cases are given."

图5

三维折叠类型空间图"

图6

蛋白质折叠类型在不同子空间上的投影"

图7

折叠类型图中螺旋束系数的分布The darker dots have larger bundle value and the lighter correspond to those with smaller bundle values. The mixing part between membrane proteins and all α proteins is marked with red color."

1 Holm L,Sander C. Mapping the protein universe. Science,1996,273(5275):595-602.
2 Hou J T,Jun S R,Zhang C,et al. Global mapping of the protein structure space and application in struc?ture?based inference of protein function. Proceedings of the National Academy of Sciences of the United States of America,2005,102(10):3651-3656.
3 Caetano?Anollés G,Wang M L,Caetano?Anollés D,et al. The origin,evolution and structure of the protein world. Biochemical Journal,2009,417(3):621-637.
4 Woolfson D N,Bartlett G J,Burton A J,et al. De novo protein design:how do we expand into the universe of possible protein structures? Current Opinion in Structural Biology2015,33:16-26.
5 Murzin A G,Brenner S E,Hubbard T,et al. SCOP:a structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology,1995,247(4):536-540.
6 Dawson N L,Lewis T E,Das S,et al. CATH:an expanded resource to predict protein function through structure and sequence. Nucleic Acids Research,2017,45(D1):D289-D295.
7 Hou J T,Sims G E,Zhang C,et al. A global representation of the protein fold space. Proceedings of the National Academy of Sciences of the United States of America,2003,100(5):2386-2390.
8 Nepomnyachiy S,Ben?Tal N,Kolodny R. Global view of the protein universe. Proceedings of the National Academy of Sciences of the United States of America,2014,111(32):11691-11696.
9 Holm L,Sander C. Protein structure comparison by alignment of distance matrices. Journal of Molecular Biology,1993,233(1):123-138.
10 Krissinel E,Henrick K. Secondary?structure matching (SSM),a new tool for fast protein structure alignment in three dimensions. Acta Crystallo?graphica Section D:Structural Biology,2004,60(12):2256-2268.
11 Han X S,Sit A,Christoffer C,et al. A global map of the protein shape universe. PLoS Computational Biology,2019,15(4):e1006969.
12 Xu J R,Zhang J Z. Impact of structure space continuity on protein fold classification. Scientific Reports,2016,6:23263.
13 Fox N K,Brenner S E,Chandonia J M. SCOPe:structural classification of proteins?extended,integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Research,2014,42(D1):D304-D309.
14 Vendruscolo M,Kussell E,Domany E. Recovery of protein structure from contact maps. Folding and Design,1997,2(5):295-306.
15 Bohr J,Bohr H,Brunak S,et al. Protein structures from distance inequalities. Journal of Molecular Biology,1993,231(3):861-869.
16 Noel J K,Whitford P C,Onuchic J N. The shadow map: a general contact definition for capturing the dynamics of biomolecular folding and function. The Journal of Physical Chemistry B,2012,116(29):8692-8702.
17 Krizhevsky A,Sutskever I,Hinton G E. ImageNet classification with deep convolutional neural networks. Communications of the ACM,2017,60(6):84-90.
18 LeCun Y,Bengio Y,Hinton G. Deep learning. Nature,2015,521(7553):436-444.
19 Wang S,Sun S Q,Li Z,et al. Accurate De Novo prediction of protein contact map by ultra?deep learning model. PLoS Computational Biology,2017,13(1):e1005324.
20 Zhu J W,Zhang H C,Li S C,et al. Improving protein fold recognition by extracting fold?specific features from predicted residue?residue contacts. Bioinformatics,2017,33(23):3749-3757.
21 Abadi M,Barham P,Chen J,et al. Tensorflow:A system for large?scale machine learning∥12th USENIX conference on operating systems design and implementation. Savannah,GA,USA:USENIX Association,2016:265-283.
22 Von Luxburg U. A tutorial on spectral clustering. Statistics and Computing,2007,17(4):395-416.
23 Rand W M. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association,1971,66(336):846-850.
24 Kabsch W,Sander C. Dictionary of protein secondary structure?pattern?recognition of hydrogen?bonded and geometrical features. Biopolymers,1983,22(12):2577-2637.
25 Tenenbaum J B,De Silva V,Langford J C. A global geometric framework for nonlinear dimensionality reduction. Science,2000,290(5500):2319-2323.
26 Lindahl E,Elofsson A. Identification of related proteins on family,superfamily and fold level. Journal of Molecular Biology,2000,295(3):613-625.
27 Osadchy M,Kolodny R. Maps of protein structure space reveal a fundamental relationship between protein structure and function. Proceedings of the National Academy of Sciences of the United States of America,2011,108(30):12301-12306.
[1] 朱伟,张帅,辛晓燕,李文飞,王骏,张建,王炜. 结合区域检测和注意力机制的胸片自动定位与识别[J]. 南京大学学报(自然科学版), 2020, 56(4): 591-600.
[2] 梅志伟,王维东. 基于FPGA的卷积神经网络加速模块设计[J]. 南京大学学报(自然科学版), 2020, 56(4): 581-590.
[3] 赵子龙,赵毅强,叶茂. 基于FPGA的多卷积神经网络任务实时切换方法[J]. 南京大学学报(自然科学版), 2020, 56(2): 167-174.
[4] 李康,谢宁,李旭,谭凯. 基于卷积神经网络和几何优化的统计染色体核型分析方法[J]. 南京大学学报(自然科学版), 2020, 56(1): 116-124.
[5] 王吉地,郭军军,黄于欣,高盛祥,余正涛,张亚飞. 融合依存信息和卷积神经网络的越南语新闻事件检测[J]. 南京大学学报(自然科学版), 2020, 56(1): 125-131.
[6] 韩普,刘亦卓,李晓艳. 基于深度学习和多特征融合的中文电子病历实体识别研究[J]. 南京大学学报(自然科学版), 2019, 55(6): 942-951.
[7] 张家精,夏巽鹏,陈金兰,倪友聪. 基于张量分解和深度学习的混合推荐算法[J]. 南京大学学报(自然科学版), 2019, 55(6): 952-959.
[8] 钟琪,冯亚琴,王蔚. 跨语言语料库的语音情感识别对比研究[J]. 南京大学学报(自然科学版), 2019, 55(5): 765-773.
[9] 王蔚, 胡婷婷, 冯亚琴. 基于深度学习的自然与表演语音情感识别[J]. 南京大学学报(自然科学版), 2019, 55(4): 660-666.
[10] 狄 岚, 何锐波, 梁久祯. 基于可能性聚类和卷积神经网络的道路交通标识识别算法[J]. 南京大学学报(自然科学版), 2019, 55(2): 238-250.
[11] 胡 太, 杨 明. 结合目标检测的小目标语义分割算法[J]. 南京大学学报(自然科学版), 2019, 55(1): 73-84.
[12] 安 晶, 艾 萍, 徐 森, 刘 聪, 夏建生, 刘大琨. 一种基于一维卷积神经网络的旋转机械智能故障诊断方法[J]. 南京大学学报(自然科学版), 2019, 55(1): 133-142.
[13] 梁蒙蒙1,周 涛1,2*,夏 勇3,张飞飞1,杨 健1. 基于随机化融合和CNN的多模态肺部肿瘤图像识别[J]. 南京大学学报(自然科学版), 2018, 54(4): 775-.
[14] 张鹏,黄毅,阮雅端,陈启美*. 基于稀疏特征的交通流视频检测算法[J]. 南京大学学报(自然科学版), 2015, 51(2): 264-270.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!