南京大学学报(自然科学版) ›› 2019, Vol. 55 ›› Issue (1): 29–40.doi: 10.13232/j.cnki.jnju.2019.01.003

• • 上一篇    下一篇

基于深度神经网络的网络安全实体识别方法

秦 娅1,2,申国伟1,2*,赵文波1,陈艳平1,2   

  1. 1.贵州大学计算机科学与技术学院,贵阳,550025;2.贵州省公共大数据重点实验室,贵阳,550025
  • 接受日期:2018-09-01 出版日期:2019-02-01 发布日期:2019-01-26
  • 通讯作者: 申国伟,E-mail:gwshen@gzu.edu.cn E-mail:gwshen@gzu.edu.cn
  • 基金资助:
    国家自然科学基金(61802081),贵州省自然科学基金(20161052),贵州省公共大数据重点实验室开放课题(2017BDKFJJ024),贵州大学博士基金(201526)

Research on the method of network security entity recognition based on deep neural network

Qin Ya1,2,Shen Guowei1,2*,Zhao Wenbo1,Chen Yanping1,2   

  1. 1. College of Computer Science and Technology,GuiZhou University,Guiyang,550025,China; 2. Guizhou Provincial Key Laboratory of Public Big Data,Guiyang,550025,China.
  • Accepted:2018-09-01 Online:2019-02-01 Published:2019-01-26
  • Contact: Shen Guowei,E-mail:gwshen@gzu.edu.cn E-mail:gwshen@gzu.edu.cn

摘要: 基于安全知识图谱的网络安全威胁情报分析能够细粒度地分析多源威胁情报数据,因此受到广泛关注. 传统的命名实体识别方法难以识别网络安全领域中新的或中英文混合的安全实体,且提取的特征不充分,因此难以准确地识别网络安全实体. 在深度神经网络模型的基础上,提出一种结合特征模板的CNN-BiLSTM-CRF的网络安全实体识别方法,利用人工特征模板提取局部上下文特征,进一步利用神经网络模型自动提取字符特征和文本全局特征. 实验结果表明,在大规模网络安全数据集上,提出的网络安全实体识别方法,相关评价指标优于其他算法,F值达到86%.

关键词: 网络安全实体识别, 特征模板, CNN, BiLSTM, CRF

Abstract: With the continuous development of the Internet technology,network security threat intelligence analysis that base on security knowledge graph(SKG) can analyze multi-source threat intelligence data in a fine-grained manner,which has received extensive attention. Traditional named entity recognition(NER) methods are difficult to identify network security entity which mix Chinese and English in the field of network security,and can't fully extract some features,so it is difficult to accurately identify the network security entity. In this paper,we propose a novel CNN-BiLSTM-CRF security entity recognition method combining with feature template(FT-CNN-BiLSTM-CRF)on the basis of deep learning model. The feature template is used to extract local context features,and neural network model is used to automatically extract character features and text global features. Firstly,each character of the input sequence is converted into a corresponding character vector,and the convolutional neural network(CNN) extracts the character-level features. Secondly,the character-level features vectors are input into the BiLSTM(Bi-Long Short-Term Memory) together with the local context vectors extracted by the feature template. The global features of the security entity are automatically extracted by BiLSTM. Finally,the CRF(Conditional Random Fields) labels the network security entity to obtain the recognition result of the security entity. The experimental results show that our method reaches 86% F-scores on the large-scale network security dataset and outperforms other methods.

Key words: network security entity recognition, feature template, CNN, BiLSTM, CRF

中图分类号: 

  • TP391
[1] 刘 峤,李 杨,段 宏 等. 知识图谱构建技术综述. 计算机研究与发展,2016,53(3):582-600.(Liu Q,Li Y,Duan H,et al. Knowledge graph construction techniques. Journal of Computer Research and Development,2016,53(3):582-600.)
[2] 李建华. 网络空间威胁情报感知、共享与分析技术综述. 网络与信息安全学报,2016,2(2):16-29.(Li J H. Overview of the technologies of threat intelligence sensing,sharing and analysis in cyber space. Chinese Journal of Network and Information Security,2016,2(2):16-29.)
[3] 张晓艳,王 挺,陈火旺. 命名实体识别研究. 计算机科学,2005,32(4):44-48.(Zhang X Y,Wang T,Chen H W. Research on named entity recognition. Computer Science,2005,32(4):44-48.)
[4] Rabiner L R. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE,1989,77(2):267-296.
[5] Koeling R. Chunking with maximum entropy models ∥ Proceedings of the 2nd workshop on Learning Language in Logic and the 4th Conference on Computational Natural Language Learning. Stroudsburg,PA,USA:Association for Computational Linguistics,2000:305-312.
[6] Lafferty J D,Mccallum A,Pereira F C N. Conditional random fields:probabilistic models for segmenting and labeling sequence data ∥ 18th International Conference on Machine Learning. San Francisco,CA,USA:Morgan Kaufmann Publishers Inc,2001:282-289.
[7] 邱泉清,苗夺谦,张志飞. 中文微博命名实体识别. 计算机科学,2013,40(6):196-198.(Qiu Q Q,Miao D Q,Zhang Z F. Named entity recognition on Chinese microblog. Computer Science,2013,40(6):196-198.)
[8] Joshi A,Lal R,Finin T,et al. Extracting cybersecurity related linked data from text ∥ 2013 IEEE Seventh International Conference on Semantic Computing. Irvine,CA,USA:IEEE Computer Society,2013:252-259.
[9] Collobert R,Weston J. A unified architecture for natural language processing:deep neural networks with multitask learning ∥ Proceedings of the 25th International Conference on Machine Learning. Helsinki,Finland:ACM,2008:160-167.
[10] Collobert R,Weston J,Bottou L,et al. Natural language processing(almost)from scratch. The Journal of Machine Learning Research,2011,12(1):2493-2537.
[11] Hochreiter S,Schmidhuber J. Long short-term memory. Neural Computation,1997,9(8):1735-1780.
[12] Hammerton J. Named entity recognition with long short-term memory ∥ Proceedings of the 7th Conference on Natural Language Learning at Hlt-Naacl. Stroudsburg,PA,USA:Association for Computational Linguistics,2003:172-175.
[13] Peng N Y,Dredze M. Named entity recognition for Chinese social media with jointly trained embeddings ∥ Proceedings of 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon,Portugal:The Association for Computational Linguistics,2015:548-554.
[14] Huang Z H,Xu W,Yu K. Bidirectional LSTM-CRF models for sequence tagging. 2018,arXiv:1508.01991.
[15] Dong C H,Zhang J J,Zong C Q,et al. Character-based LSTM-CRF with radical-level features for Chinese named entity recognition ∥ International Conference on Computer Processing of Oriental Languages. Springer Berlin Heidelberg,2016:239-250.
[16] Lample G,Ballesteros M,Subramanian S,et al. Neural architectures for named entity recognition. 2016,arXiv:1603.01360.
[17] Chiu J P C,Nichols E. Named entity recognition with bidirectional LSTM-CNNs. 2016,arXiv:1511.08308.
[18] Ma X Z,Hovy E. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. 2016,arXiv:1603.01354.
[19] Mikolov T,Chen K,Corrado G,et al. Efficient estimation of word representations in vector space. 2013,arXiv:1301.3781.
[20] Mikolov T,Sutskever I,Chen K,et al. Distributed representations of words and phrases and their compositionality ∥ Proceedings of the 26th International Conference on Neural Information Processing Systems. Lake Tahoe,NV,USA:Curran Associates Inc,2013,26:3111-3119.
[21] Lécun Y,Bottou L,Bengio Y,et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE,1998,86(11):2278-2324.
[22] Yang Y M. An evaluation of statistical approaches to text categorization. Information Retrieval,1999,1(1-2):69-90.
[1] 杨 薇, 王洪元, 张 继, 张中宝. 一种基于Faster-RCNN的车辆实时检测改进算法[J]. 南京大学学报(自然科学版), 2019, 55(2): 231-237.
[2]  王红斌,李金绘,沈 强*,线岩团,毛存礼.  基于最大熵的泰语句子级实体从属关系抽取[J]. 南京大学学报(自然科学版), 2017, 53(4): 738-.
[3] 珠 杰1,2*,李天瑞1,刘胜久1. 基于条件随机场的藏文人名识别技术研究[J]. 南京大学学报(自然科学版), 2016, 52(2): 289-.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 林 銮,陆武萍,唐朝生,赵红崴,冷 挺,李胜杰. 基于计算机图像处理技术的松散砂性土微观结构定量分析方法[J]. 南京大学学报(自然科学版), 2018, 54(6): 1064 -1074 .
[2] 段新春,施 斌,孙梦雅,魏广庆,顾 凯,冯晨曦. FBG蒸发式湿度计研制及其响应特性研究[J]. 南京大学学报(自然科学版), 2018, 54(6): 1075 -1084 .
[3] 梅世嘉,施 斌,曹鼎峰,魏广庆,张 岩,郝 瑞. 基于AHFO方法的Green-Ampt模型K0取值试验研究[J]. 南京大学学报(自然科学版), 2018, 54(6): 1085 -1094 .
[4] 卢 毅,于 军,龚绪龙,王宝军,魏广庆,季峻峰. 基于DFOS的连云港第四纪地层地面沉降监测分析[J]. 南京大学学报(自然科学版), 2018, 54(6): 1114 -1123 .
[5] 胡 淼,王开军,李海超,陈黎飞. 模糊树节点的随机森林与异常点检测[J]. 南京大学学报(自然科学版), 2018, 54(6): 1141 -1151 .
[6] 洪思思,曹辰捷,王 喆*,李冬冬. 基于矩阵的AdaBoost多视角学习[J]. 南京大学学报(自然科学版), 2018, 54(6): 1152 -1160 .
[7] 魏 桐,童向荣. 基于加权启发式搜索的鲁棒性信任路径生成[J]. 南京大学学报(自然科学版), 2018, 54(6): 1161 -1170 .
[8] 陆慎涛, 葛洪伟, 周 竞. 自动确定聚类中心的移动时间势能聚类算法[J]. 南京大学学报(自然科学版), 2019, 55(1): 143 -153 .
[9] 仲昭朝, 邹 婷, 唐惠炜, 庄 重, 张 臻. 铜胁迫对蚕豆根尖细胞凋亡及线粒体功能的影响[J]. 南京大学学报(自然科学版), 2019, 55(1): 154 -160 .
[10] 吕海敏,沈水龙,严学新,史玉金,许烨霜. 上海地面沉降对轨道交通安全运营风险评估[J]. 南京大学学报(自然科学版), 2019, 55(3): 392 -400 .