南京大学学报(自然科学版) ›› 2022, Vol. 58 ›› Issue (1): 103–114.doi: 10.13232/j.cnki.jnju.2022.01.011

• • 上一篇    

基于蛋白质序列的氨基酸字母表简化

张鑫鹏, 王骏(), 王炜()   

  1. 南京大学物理学院,南京,210093
  • 收稿日期:2021-07-12 出版日期:2022-01-30 发布日期:2022-02-22
  • 通讯作者: 王骏,王炜 E-mail:wangj@nju.edu.cn
  • 作者简介:E⁃mail:wangj@nju.edu.cnwangwei@nju.edu.cn
  • 基金资助:
    国家自然科学基金(11774157)

Simplification of amino acid alphabet based on protein sequences

Xinpeng Zhang, Jun Wang(), Wei Wang()   

  1. School of Physics,Nanjing University,Nanjing,210093,China
  • Received:2021-07-12 Online:2022-01-30 Published:2022-02-22
  • Contact: Jun Wang,Wei Wang E-mail:wangj@nju.edu.cn

摘要:

蛋白质的结构和功能特性由其氨基酸序列编码,控制序列结构映射的规则被认为是二级遗传密码,氨基酸字母表的简化可以减少蛋白质序列中的冗余,有助于揭示编码规则.基于氨基酸的单体特征、成对相互作用和相似性,可以简化氨基酸字母表.目前,仅基于蛋白质的序列信息,根据最近邻氨基酸的出现频率构建了一个氨基酸的嵌入表示.在此基础上,提出一种通过重构最近邻氨基酸的出现频率来压缩嵌入表示的模型,将此方法命名为AA2Vec.实验结果表明,与其他表示维相比,特定表示维(三维)具有显著的鲁棒性.提取的信息捕捉了氨基酸的物理化学和生化特性以及最近邻氨基酸之间的相互作用.值得注意的是,提出的方法对于具有不同序列标识的序列数据集(SCOPe)是稳定的.这种方法给出了氨基酸的最简表示,有助于生成蛋白质序列的简化表示和建立蛋白质的简化模型.

关键词: 序列信息, 氨基酸相互作用, AA2Vec, 氨基酸字母表

Abstract:

The Structural and functional properties of proteins are encoded in their amino acid sequences. The rules governing the sequence?structure mapping are believed as the secondary genetic code. The simplification of the amino acid alphabet is a way to reduce the redundancy in the protein sequences and to help to disclose the coding rules. Based on the monomeric features as well as pairwise interactions and similarity of the amino acids,the amino acid alphabet can be simplified. Now,based on solely the sequence information of proteins,we construct an embedded representation of amino acids based on the occurrence frequency of the nearest neighbor amino acids. Based on this representation,we propose a model to compress the embedded representation by reconstructing the occurrence frequency of the nearest neighbor amino acids. We name this method AA2Vec. It is observed that the specific representation dimension (the three dimension) has a significant robustness comparing with the others. The extracted information captures the physicochemical and biochemical properties of amino acids and nearest neighbor amino acids' interaction. It is worth noting that our method is stable for sequence dataset (SCOPe) with different sequence identities. Our method proposes the minimal representation of amino acids and this kind of characterization may help to generate simplified representations for protein sequences and to build simplified models for proteins.

Key words: sequence information, amino acid interaction, AA2Vec, amino acid alphabet

中图分类号: 

  • Q61

图1

AA2Vec模型的示意图The figure exemplifies the training process of the model. The orange and green blocks represent the amino acid matrix Uand the neighboring matrix V. The table on the top right is the normalized co?occurrence matrix M. Entries of the co?occurrence matrix M denote the probability that amino acids appear in the neighbor."

图2

模型的训练结果:(a)在SCOPe40数据集上维度为3的训练曲线,插图为更小的x和y轴;(b)平均偏差σP在不同维度的结果"

表1

W矩阵本征值和标准差"

Eigenvalue 1Eigenvalue 2Eignvalue 3
λ1=0.90±0.01λ2=0.0569±0.0002λ3=-0.93±0.01

图3

在不同数据集上嵌入维度上PIP损失的标准差:(a) SCOPe40数据集;(b) SCOPe数据集;(c) DisProt数据集There are 20 different embedding matrix for each dimension and each gray dot represents one PIP loss (194 dots in total). Each dimension's loss data is drawn with a box representing the lower bound,the first quartile,median,the third quartile and the upper bound of the data. Different color bars represent different dimension."

图4

氨基酸的三维嵌入空间:(a) 20种氨基酸均以红球的方式表示,每个字母代表氨基酸的缩写;(b)和(c)为同一个空间的二维投影"

图5

AA2Vec和非取整BLOSUM62的相似性对比:(a) x轴为BLOSUM相似性,y轴为AA2Vec表示空间的相似性,每个蓝色点代表一个氨基酸对;(b) BLOSUM62和AA2Vec相似性的分布直方图(b) the histogram of BLOSUM62 and AA2Vec distribution"

图6

20个氨基酸的层次聚类图Cluster algorithm is based on euclidean distance. Rows and columns denote dimension indexes and amino acid typerespectively. Colors in each cell of the map stands for the value of the vector on that dimension. Green boxes circle 4 major clusters in hierarchical clustering."

图7

LDA投影和相关系数:(a) LDA投影值和一个实验结果的对比[29],相关系数为0.91,每个字母为氨基酸的缩写,红色字母和蓝色字母为两组标记的氨基酸,绿色和黄色的氨基酸为未标记的氨基酸,是根据不同的物理化学性质分别标记的颜色;(b) LDA投影值和AAIndex数值的相关系数统计直方图,共35种亲疏水相关的实验值"

图8

短肽层次聚类图Four typical peptides and the corresponding proteins are shown on the left. The clustering table is shown on the right with three types of pattern. The first column is HP pattern with "H" standing for {AVLIMFW},"P" standing for{STNQYKRHDE} and "_" standing for the others ({GCP}). The second column is positive and negative pattern with "P" standing for positive amino acids {RHK},"N" standing for {DE} and "_" standing for the others. The last column is aromatic and aliphatic amino acids pattern. We use "F" represents aromatic amino acids {HYWF},"Z" represents aliphatic amino acids {ILV} and "_" represents the others."

1 Consortium UniProt. Uniprot:The universal protein knowledgebase in 2021. Nucleic Acids Research,2021,49(D1):D480-D489.
2 Senior A W,Evans R,Jumper J,et al. Improved protein structure prediction using potentials from deep learning. Nature,2020,577(7792):706-710.
3 Yang J Y,Anishchenko I,Park H,et al. Improved protein structure prediction using predicted interresidue orientations. Proceedings of the National Academy of Sciences of the United States of America,2020,117(3):1496-1503.
4 Xu J B,Mcpartlon M,Li J. Improved protein structure prediction by deep learning irrespective of co?evolution information. bioRxiv,2020,doi:10.1101/2020.10.12.336859.
5 Xu J B,Wang S. Analysis of distance?based protein structure prediction by deep learning in CASP13. Proteins:Structure,Function,and Bioinformatics,2019,87(12):1069-1081.
6 Hashemifar S,Neyshabur B,Khan A A,et al. Predicting protein?protein interactions through sequence?based deep learning. Bioinformatics,2018,34(17):i802–i810.
7 Wang S,Sun S Q,Li Z,et al. Accurate de novo prediction of protein contact map by ultra?deep learning model. PLoS Computational Biology,2017,13(1):e1005324.
8 Hanson J,Paliwal K K,Litfin T,et al. Getting to know your neighbor:Protein structure prediction comes of age with contextual machine learning. Journal of Computational Biology,2020,27(5):796-814.
9 Wang J,Wang W. Simplification of complexity in protein molecular systems by grouping amino acids:A view from physics. Advances in Physics:X,2016,1(3):444-466.
10 Cieplak M,Holter N,Maritan A,et al. Amino acid classes and the protein folding problem. The Journal of Chemical Physics,2001,114(3):1420-1423.
11 Pape S,Hoffgaard F,Hamacher K. Distance?dependent classification of amino acids by information theory. Proteins:Structure,Function,and Bioinformatics,2010,78(10):2322-2328.
12 Wang J,Wang W. A computational approach to simplifying the protein folding alphabet. Nature Structural Biology,1999,6(11):1033-1038.
13 Li T P,Fan K,Wang J,et al. Reduction of protein sequence complexity by residue grouping. Protein Engineering,2003,16(3):323-330.
14 Cannata N,Toppo S,Romualdi C,et al. Simplifying amino acid alphabets by means of a branch and bound algorithm and substitution matrices. Bioinformatics,2002,18(8):1102-1108.
15 Wrabl J Q,Grishin N V. Grouping of amino acid types and extraction of amino acid properties from multiple sequence alignments using variance maximization. Proteins:Structure,Function,and Bioinformatics,2005,61(3):523-534.
16 Asgari E,Mofrad M R K. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS One,2015,10(11):e0141287.
17 Kimothi D,Soni A,Biyani P,et al. Distributed representations for biological sequence analysis. 2016,arXiv:.
18 Yang K K,Wu Z,Bedbrook C N,et al. Learned protein embeddings for machine learning. Bioinformatics,2018,34(15):2642-2648.
19 Hamid M N,Friedberg I. Identifying antimicrobial peptides using word embedding with deep recurrent neural networks. bioRxiv,2018,doi:10.1101/255505.
20 Rives A,Meier J,Sercu T,et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences of the United States of America,2021,118(15):e2016239118.
21 Murzin A G,Brenner S E,Hubbard T,et al. SCOP:A structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology,1995,247(4):536-540.
22 Sickmeier M,Hamilton J A,Legall T,et al. Disprot:The database of disordered proteins. Nucleic Acids Research,2007,35(S1):D786-D793.
23 Zhou Q,Tang P Z,Liu S X,et al. Learning atoms for materials discovery. Proceedings of the National Academy of Sciences of the United States of America,2018,115(28):E6411-E6417.
24 Hamilton W L,Leskovec J,Jurafsky D. Cultural shift or linguistic drift? Comparing two computational measures of semantic change∥Proceedings of 2016 Conference on Empirical Methods in Natural Language Processing. Austin,TX,USA:Association for Computational Linguistics,2016:2116-2121.
25 Artetxe M,Labaka G,Agirre E. Learning principled bilingual mappings of word embeddings while preserving monolingual invariance∥Proceedings of 2016 Conference on Empirical Methods in Natural Language Processing. Austin,TX,USA:Association for Computational Linguistics,2016:2289-2294.
26 Smith S L,Turban D H P,Hamblin S,et al. Offline bilingual word vectors,orthogonal transformations and the inverted softmax. 2017,arXiv:.
27 Henikoff S,Henikoff J G. Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the United States of America,1992,89(22):10915-10919.
28 Kawashima S,Pokarowski P,Pokarowska M,et al. AAindex:Amino acid index database,progress report 2008. Nucleic Acids Research,2008,36(S1):D202-D205.
29 Black S D,Mould D R. Development of hydrophobicity parameters to analyze proteins which bear post?or cotranslational modifications. Analytical Biochemistry,1991,193(1):72-82.
[1] 潘越,王骏,李文飞,张建,王炜. 基于卷积神经网络的蛋白质折叠类型最小特征提取[J]. 南京大学学报(自然科学版), 2020, 56(5): 744-753.
[2] 李佳云,吴人杰. 基因转录爆发的建模研究[J]. 南京大学学报(自然科学版), 2020, 56(3): 418-429.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!