基于蛋白质序列的氨基酸字母表简化

doi:10.13232/j.cnki.jnju.2022.01.011

南京大学学报(自然科学版) ›› 2022, Vol. 58 ›› Issue (1): 103–114.doi: 10.13232/j.cnki.jnju.2022.01.011

• • 上一篇

基于蛋白质序列的氨基酸字母表简化

张鑫鹏, 王骏(), 王炜()

南京大学物理学院，南京，210093

收稿日期:2021-07-12 出版日期:2022-01-30 发布日期:2022-02-22
通讯作者: 王骏,王炜 E-mail:wangj@nju.edu.cn
作者简介:E⁃mail：wangj@nju.edu.cn，wangwei@nju.edu.cn
基金资助:
国家自然科学基金(11774157)

Simplification of amino acid alphabet based on protein sequences

Xinpeng Zhang, Jun Wang(), Wei Wang()

School of Physics，Nanjing University，Nanjing，210093，China

Received:2021-07-12 Online:2022-01-30 Published:2022-02-22
Contact: Jun Wang,Wei Wang E-mail:wangj@nju.edu.cn

摘要/Abstract

摘要：

蛋白质的结构和功能特性由其氨基酸序列编码，控制序列结构映射的规则被认为是二级遗传密码，氨基酸字母表的简化可以减少蛋白质序列中的冗余，有助于揭示编码规则.基于氨基酸的单体特征、成对相互作用和相似性，可以简化氨基酸字母表.目前，仅基于蛋白质的序列信息，根据最近邻氨基酸的出现频率构建了一个氨基酸的嵌入表示.在此基础上，提出一种通过重构最近邻氨基酸的出现频率来压缩嵌入表示的模型，将此方法命名为AA2Vec.实验结果表明，与其他表示维相比，特定表示维（三维）具有显著的鲁棒性.提取的信息捕捉了氨基酸的物理化学和生化特性以及最近邻氨基酸之间的相互作用.值得注意的是，提出的方法对于具有不同序列标识的序列数据集（SCOPe）是稳定的.这种方法给出了氨基酸的最简表示，有助于生成蛋白质序列的简化表示和建立蛋白质的简化模型.

关键词: 序列信息, 氨基酸相互作用, AA2Vec, 氨基酸字母表

Abstract:

The Structural and functional properties of proteins are encoded in their amino acid sequences. The rules governing the sequence?structure mapping are believed as the secondary genetic code. The simplification of the amino acid alphabet is a way to reduce the redundancy in the protein sequences and to help to disclose the coding rules. Based on the monomeric features as well as pairwise interactions and similarity of the amino acids，the amino acid alphabet can be simplified. Now，based on solely the sequence information of proteins，we construct an embedded representation of amino acids based on the occurrence frequency of the nearest neighbor amino acids. Based on this representation，we propose a model to compress the embedded representation by reconstructing the occurrence frequency of the nearest neighbor amino acids. We name this method AA2Vec. It is observed that the specific representation dimension (the three dimension) has a significant robustness comparing with the others. The extracted information captures the physicochemical and biochemical properties of amino acids and nearest neighbor amino acids' interaction. It is worth noting that our method is stable for sequence dataset (SCOPe) with different sequence identities. Our method proposes the minimal representation of amino acids and this kind of characterization may help to generate simplified representations for protein sequences and to build simplified models for proteins.

Key words: sequence information, amino acid interaction, AA2Vec, amino acid alphabet

中图分类号:

张鑫鹏, 王骏, 王炜. 基于蛋白质序列的氨基酸字母表简化[J]. 南京大学学报(自然科学版), 2022, 58(1): 103–114.

Xinpeng Zhang, Jun Wang, Wei Wang. Simplification of amino acid alphabet based on protein sequences[J]. Journal of Nanjing University(Natural Sciences), 2022, 58(1): 103–114.

图/表 9

图1

图2

表1

W矩阵本征值和标准差"

Eigenvalue 1	Eigenvalue 2	Eignvalue 3
$λ 1 = 0.90 ± 0.01$	$λ 2 = 0.0569 ± 0.0002$	$λ 3 = - 0.93 ± 0.01$

表1

图3

图4

图5

图6

图7

图8

参考文献 29

1	Consortium UniProt. Uniprot：The universal protein knowledgebase in 2021. Nucleic Acids Research，2021，49(D1)：D480-D489.
2	Senior A W，Evans R，Jumper J，et al. Improved protein structure prediction using potentials from deep learning. Nature，2020，577(7792)：706-710.
3	Yang J Y，Anishchenko I，Park H，et al. Improved protein structure prediction using predicted interresidue orientations. Proceedings of the National Academy of Sciences of the United States of America，2020，117(3)：1496-1503.
4	Xu J B，Mcpartlon M，Li J. Improved protein structure prediction by deep learning irrespective of co?evolution information. bioRxiv，2020，doi：10.1101/2020.10.12.336859.
5	Xu J B，Wang S. Analysis of distance?based protein structure prediction by deep learning in CASP13. Proteins：Structure，Function，and Bioinformatics，2019，87(12)：1069-1081.
6	Hashemifar S，Neyshabur B，Khan A A，et al. Predicting protein?protein interactions through sequence?based deep learning. Bioinformatics，2018，34(17)：i802–i810.
7	Wang S，Sun S Q，Li Z，et al. Accurate de novo prediction of protein contact map by ultra?deep learning model. PLoS Computational Biology，2017，13(1)：e1005324.
8	Hanson J，Paliwal K K，Litfin T，et al. Getting to know your neighbor：Protein structure prediction comes of age with contextual machine learning. Journal of Computational Biology，2020，27(5)：796-814.
9	Wang J，Wang W. Simplification of complexity in protein molecular systems by grouping amino acids：A view from physics. Advances in Physics：X，2016，1(3)：444-466.
10	Cieplak M，Holter N，Maritan A，et al. Amino acid classes and the protein folding problem. The Journal of Chemical Physics，2001，114(3)：1420-1423.
11	Pape S，Hoffgaard F，Hamacher K. Distance?dependent classification of amino acids by information theory. Proteins：Structure，Function，and Bioinformatics，2010，78(10)：2322-2328.
12	Wang J，Wang W. A computational approach to simplifying the protein folding alphabet. Nature Structural Biology，1999，6(11)：1033-1038.
13	Li T P，Fan K，Wang J，et al. Reduction of protein sequence complexity by residue grouping. Protein Engineering，2003，16(3)：323-330.
14	Cannata N，Toppo S，Romualdi C，et al. Simplifying amino acid alphabets by means of a branch and bound algorithm and substitution matrices. Bioinformatics，2002，18(8)：1102-1108.
15	Wrabl J Q，Grishin N V. Grouping of amino acid types and extraction of amino acid properties from multiple sequence alignments using variance maximization. Proteins：Structure，Function，and Bioinformatics，2005，61(3)：523-534.
16	Asgari E，Mofrad M R K. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS One，2015，10(11)：e0141287.
17	Kimothi D，Soni A，Biyani P，et al. Distributed representations for biological sequence analysis. 2016，arXiv:.
18	Yang K K，Wu Z，Bedbrook C N，et al. Learned protein embeddings for machine learning. Bioinformatics，2018，34(15)：2642-2648.
19	Hamid M N，Friedberg I. Identifying antimicrobial peptides using word embedding with deep recurrent neural networks. bioRxiv，2018，doi：10.1101/255505.
20	Rives A，Meier J，Sercu T，et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences of the United States of America，2021，118(15)：e2016239118.
21	Murzin A G，Brenner S E，Hubbard T，et al. SCOP：A structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology，1995，247(4)：536-540.
22	Sickmeier M，Hamilton J A，Legall T，et al. Disprot：The database of disordered proteins. Nucleic Acids Research，2007，35(S1)：D786-D793.
23	Zhou Q，Tang P Z，Liu S X，et al. Learning atoms for materials discovery. Proceedings of the National Academy of Sciences of the United States of America，2018，115(28)：E6411-E6417.
24	Hamilton W L，Leskovec J，Jurafsky D. Cultural shift or linguistic drift? Comparing two computational measures of semantic change∥Proceedings of 2016 Conference on Empirical Methods in Natural Language Processing. Austin，TX,USA：Association for Computational Linguistics，2016：2116-2121.
25	Artetxe M，Labaka G，Agirre E. Learning principled bilingual mappings of word embeddings while preserving monolingual invariance∥Proceedings of 2016 Conference on Empirical Methods in Natural Language Processing. Austin，TX,USA：Association for Computational Linguistics，2016：2289-2294.
26	Smith S L，Turban D H P，Hamblin S，et al. Offline bilingual word vectors，orthogonal transformations and the inverted softmax. 2017，arXiv：.
27	Henikoff S，Henikoff J G. Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the United States of America，1992，89(22)：10915-10919.
28	Kawashima S，Pokarowski P，Pokarowska M，et al. AAindex：Amino acid index database，progress report 2008. Nucleic Acids Research，2008，36(S1)：D202-D205.
29	Black S D，Mould D R. Development of hydrophobicity parameters to analyze proteins which bear post?or cotranslational modifications. Analytical Biochemistry，1991，193(1)：72-82.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

[1]	潘越,王骏,李文飞,张建,王炜. 基于卷积神经网络的蛋白质折叠类型最小特征提取[J]. 南京大学学报(自然科学版), 2020, 56(5): 744-753.
[2]	李佳云,吴人杰. 基因转录爆发的建模研究[J]. 南京大学学报(自然科学版), 2020, 56(3): 418-429.

基于蛋白质序列的氨基酸字母表简化

Simplification of amino acid alphabet based on protein sequences

RichHTML

PDF (PC)

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 29

相关文章 2

Metrics

本文评价

推荐阅读 0