南京大学学报(自然科学版) ›› 2020, Vol. 56 ›› Issue (2): 270277.doi: 10.13232/j.cnki.jnju.2020.02.013
摘要:
在网络信息技术已非常成熟的今天,各类敏感词包括色情、暴力、政治敏感等有害词汇充斥网站和社交软件,对这些词语的检测与识别对营造健康的网络环境非常必要.这些敏感词绝大部分试图通过读音或者字形相近来进行伪装以逃避检测系统.现有的匹配算法可以检测出读音完全一样的词语,但不能准确识别读音相近和字形相近的异体字.为解决这一问题,提出针对模糊匹配的汉字相似度对比算法.首先通过对汉字进行特殊编码,提出一种综合考虑读音及字形特点的音形码汉字相似度改进算法,然后针对传统字典树,添加了精度参数来设置匹配精度,以此完成敏感词检测.实验结果计算表明,在常用相似汉字数据集上,匹配准确度提高8%~39%,错误率减少6%~38%.
中图分类号:
1 | 相姝.字典树算法的分析和实现.太原科技大学硕士学位论文.太原,2017. |
Xiang S. The trie tree analysis and realization of algorithm. Matser Dissertation. Taiyuan:Taiyuan University of Science and Technology,2017. | |
2 | 李文,洪亲,滕忠坚.基于n?gram的字符串分割技术的算法实现.计算机与现代化,2010(9):85-87, 91. |
Li W,Hong Q,Teng Z J. Implementation of algorithm based on n?gram string segmentation. Computer and Modernization,2010(9):85-87, 91. | |
3 | 陈鸣,杜庆治,邵玉斌等.基于音形码的汉字相似度比对算法.信息技术,2018(11):73-75. |
Chen M,Du Q Z,Shao Y B,et al. Chinese characters similarity comparison algorithm based on phonetic code and shape code. Information Technology,2018(11):73-75. | |
4 | 许慎.说文解字.北京:中国戏剧出版社,2008:1-100. |
5 | 徐祖友.王云五与四角号码检字法.辞书研究,1990(6):128-134. |
Xu Z Y. Wang Yunwu and quarter number character checking method. Lexicographical Studies,1990(6):128-134. | |
6 | 邵清,叶琨.基于编辑距离和相似度改进的汉字字符串匹配.电子科技,2016,29(9):7-11. |
Shao Q,Ye K. Chinese character string matching algorithm based on improved edit distance and similarity. Electronic Science and Technology,2016,29(9):7-11. | |
7 | 石跃祥,文华,龚平.基于模糊神经网络的语义映射方法及其在自然图像检索中的应用.计算机科学,2013,40(12):122-126. |
Shi Y X,Wen H,Gong P,et al. Projection of semantics and retrieval in natural scenery images based on fuzzy nerve network. Computer Science,2013,40(12):122-126. | |
8 | 景祯彦.快速高效多模式匹配算法的研究与实现.硕士学位论文. 西安:西安电子科技大学,2017. |
Jing Z Y. The research and implement of fast and efficient multi-pattern matching algorithm. Master Dissertation. Xi'an:Xidian University,2017. | |
9 | 冯伟.基于Ngram模型解决分词歧义的中文分词算法.科研,2017(1):87-88. |
10 | 黄贤英,谢晋,龙姝言.基于公共词块及N?gram模型的问句相似度算法.重庆理工大学学报(自然科学),2017,30(10):175-179, 197. |
Huang X Y,Xie J,Long S Y. Question similarity algorithm based on common chunks and N?gram model. Journal of Chongqing Institute of Technology (Natural Science),2017,30(10):175-179, 197. | |
11 | 王道仁,杨冠灿,傅俊英.专利发明人英文重名识别判据及效度比较分析.数字图书馆论坛,2016(8):2-9. |
Wang D R,Yang G C,Fu J Y. A comparative analysis of English name recognition criterion and the validity of the patent inventor. Digital Library Forum,2016(8):2-9. | |
12 | 周文德.现行汉字形近字分析.西南师范大学学报(人文社会科学版),2000,26(3):125-129. |
Zhou W D. An analysis of the current Chinese characters similar in form. Journal of Southwest University (Humanities and Social Sciences Edition),2000,26(3):125-129. | |
13 | 袁可明.相近字大全.百度文库,2018:1-150. |
14 | 孙华,张航.汉字识别方法综述.计算机工程,2010,36(20):194-197. |
Sun H,Zhang H. Survey on Chinese character recognition method. Computer Engineering,2010,36(20):194-197. | |
15 | 杨忠伟,王轩,姚霖.基于双拼映射的中文多模式模糊匹配算法∥第一届全国信息检索与内容安全学术会议论文集.上海:中国中文信息学会,2004. |
16 | 杜艾永,李立顺,朱愿等.基于汉字机内编码的中文相似重复记录消除研究.电脑知识与技术,2009,5(29):8314-8316. |
Du A Y,Li L S,Zhu Y,et al. Research on eliminating duplicate records based on Chinese character code. Computer Knowledge and Technology,2009,5(29):8314-8316. | |
17 | Wu F,Cai Y X.A Chinese message sensitive words filtering system based on DFA and Word2vec.Procedia Computer Science,2018,139:293-298. |
[1] | 曹明,于小利,罗中涌,公勋,章德 . 利用时域差分法对薄膜体声波谐振进行二维分析[J]. 南京大学学报(自然科学版), 2013, 49(1): 40-45. |
|