南京大学学报(自然科学版) ›› 2020, Vol. 56 ›› Issue (2): 270–277.doi: 10.13232/j.cnki.jnju.2020.02.013

• • 上一篇    下一篇

基于改进音形码的中文敏感词检测算法

周昊1,2,沈庆宏1()   

  1. 1.南京大学电子科学与工程学院,南京,210023
    2.江苏金晓电子信息股份有限公司,南京,210023
  • 收稿日期:2019-11-06 出版日期:2020-03-30 发布日期:2020-04-02
  • 通讯作者: 沈庆宏 E-mail:qhshen@nju.edu.cn
  • 基金资助:
    国家自然科学基金(61673301);江苏省自然科学基金(BK20151299)

Chinese sensitive words detection algorithm based on improved sound⁃character code

Hao Zhou1,2,Qinghong Shen1()   

  1. 1.School of Electronic Science and Engineering, Nanjing University, Nanjing, 210023, China
    2.Jiangsu Jinxiao Electronic Information Co. , Ltd. , Nanjing, 210023, China
  • Received:2019-11-06 Online:2020-03-30 Published:2020-04-02
  • Contact: Qinghong Shen E-mail:qhshen@nju.edu.cn

摘要:

在网络信息技术已非常成熟的今天,各类敏感词包括色情、暴力、政治敏感等有害词汇充斥网站和社交软件,对这些词语的检测与识别对营造健康的网络环境非常必要.这些敏感词绝大部分试图通过读音或者字形相近来进行伪装以逃避检测系统.现有的匹配算法可以检测出读音完全一样的词语,但不能准确识别读音相近和字形相近的异体字.为解决这一问题,提出针对模糊匹配的汉字相似度对比算法.首先通过对汉字进行特殊编码,提出一种综合考虑读音及字形特点的音形码汉字相似度改进算法,然后针对传统字典树,添加了精度参数来设置匹配精度,以此完成敏感词检测.实验结果计算表明,在常用相似汉字数据集上,匹配准确度提高8%~39%,错误率减少6%~38%.

关键词: 敏感词, 模糊匹配, 汉字编码, 汉字相似度, 完全匹配

Abstract:

Today when network technology is very mature,all kinds of sensitive words including pornographic,violent,politically words are flooding websites. The detection and identification of these words is necessary to create a healthy network environment. Most of these sensitive words attempt to camouflage through the pronunciation or glyph to deceive the detection system. The existing matching algorithm can detect words with exactly the same pronunciation,but can not accurately identify the variants with similar pronunciation or similar fonts. To solve this problem,a Chinese character similarity comparison algorithm for fuzzy matching is proposed here. Firstly,through the special coding of Chinese characters,this paper proposes an improved algorithm for the similarity of Chinese characters in sound and shape codes considering the characteristics of pronunciation and font. Then,with the traditional dictionary tree,precision parameters are added to set the matching precision,so as to complete the sensitive word detection. The experimental results show that the matching accuracy is improved by 8%~39% and the error rate is reduced by 6%~38% on the commonly used similar Chinese character dataset.

Key words: sensitive words, fuzzy matching, Chinese code, character similarity, exact matching

中图分类号: 

  • TP391

表1

拼音声母编码"

b=00000p=00001m=00011f=00010
d=00111t=00101n=00100l=01100
g=01111k=01110h=01010
j=01001q=01000x=11000
zh=11011ch=11010sh=11110r=11111
z=11011c=11010s=11110

表2

拼音韵母编码"

a=00000ai=00001ao=00011an=00010ang=00010
i=00111ie=00101iu=00100in=01100ing=01100
o=01111ou=01110ong=01010
e=01001ei=01000er=11000en=11001eng=11001
u=11010ui=11110un=11111
ü=11100üe=10100ün=10101

表3

汉字结构示例与编码[4]"

结构示例用字编码结构示例用字编码
独体字个、大、天0000右上包围旬、武、习0100
左右结构好、级、利0001上三包围闩、肉、周1100
左中右结构搬、撇、鞭0011下三包围凶、函1101
上下结构思、定、替0010左三包围区、叵1111
上中下结构衮、亵0110全包围国、回、因1110
左上包围厅、库、店0111穿插结构兆、非1010
左下包围赵、远、尬0101品字结构磊、焱1011

表4

四角号码取角方法[5]"

笔名号码笔形用例笔名号码笔形用例
05
16
27
38
49

表5

示例汉字泛化对比"

示例汉字编辑距离未改进音形码本文编码方法
狼/琅0.60.8750.935
琅/娘0.60.850.830
大/太0.60.750.935
太/态0.70.8750.9122

图1

字典树结构"

图2

完全匹配流程"

图3

模糊匹配流程"

图4

相似汉字对相似性分布"

图5

不相似汉字对相似性分布"

1 相姝.字典树算法的分析和实现.太原科技大学硕士学位论文.太原,2017.
Xiang S. The trie tree analysis and realization of algorithm. Matser Dissertation. Taiyuan:Taiyuan University of Science and Technology,2017.
2 李文,洪亲,滕忠坚.基于n?gram的字符串分割技术的算法实现.计算机与现代化,2010(9):85-87, 91.
Li W,Hong Q,Teng Z J. Implementation of algorithm based on n?gram string segmentation. Computer and Modernization,2010(9):85-87, 91.
3 陈鸣,杜庆治,邵玉斌等.基于音形码的汉字相似度比对算法.信息技术,2018(11):73-75.
Chen M,Du Q Z,Shao Y B,et al. Chinese characters similarity comparison algorithm based on phonetic code and shape code. Information Technology,2018(11):73-75.
4 许慎.说文解字.北京:中国戏剧出版社,2008:1-100.
5 徐祖友.王云五与四角号码检字法.辞书研究,1990(6):128-134.
Xu Z Y. Wang Yunwu and quarter number character checking method. Lexicographical Studies,1990(6):128-134.
6 邵清,叶琨.基于编辑距离和相似度改进的汉字字符串匹配.电子科技,2016,29(9):7-11.
Shao Q,Ye K. Chinese character string matching algorithm based on improved edit distance and similarity. Electronic Science and Technology,2016,29(9):7-11.
7 石跃祥,文华,龚平.基于模糊神经网络的语义映射方法及其在自然图像检索中的应用.计算机科学,2013,40(12):122-126.
Shi Y X,Wen H,Gong P,et al. Projection of semantics and retrieval in natural scenery images based on fuzzy nerve network. Computer Science,2013,40(12):122-126.
8 景祯彦.快速高效多模式匹配算法的研究与实现.硕士学位论文. 西安:西安电子科技大学,2017.
Jing Z Y. The research and implement of fast and efficient multi-pattern matching algorithm. Master Dissertation. Xi'an:Xidian University,2017.
9 冯伟.基于Ngram模型解决分词歧义的中文分词算法.科研,2017(1):87-88.
10 黄贤英,谢晋,龙姝言.基于公共词块及N?gram模型的问句相似度算法.重庆理工大学学报(自然科学),2017,30(10):175-179, 197.
Huang X Y,Xie J,Long S Y. Question similarity algorithm based on common chunks and N?gram model. Journal of Chongqing Institute of Technology (Natural Science),2017,30(10):175-179, 197.
11 王道仁,杨冠灿,傅俊英.专利发明人英文重名识别判据及效度比较分析.数字图书馆论坛,2016(8):2-9.
Wang D R,Yang G C,Fu J Y. A comparative analysis of English name recognition criterion and the validity of the patent inventor. Digital Library Forum,2016(8):2-9.
12 周文德.现行汉字形近字分析.西南师范大学学报(人文社会科学版),2000,26(3):125-129.
Zhou W D. An analysis of the current Chinese characters similar in form. Journal of Southwest University (Humanities and Social Sciences Edition),2000,26(3):125-129.
13 袁可明.相近字大全.百度文库,2018:1-150.
14 孙华,张航.汉字识别方法综述.计算机工程,2010,36(20):194-197.
Sun H,Zhang H. Survey on Chinese character recognition method. Computer Engineering,2010,36(20):194-197.
15 杨忠伟,王轩,姚霖.基于双拼映射的中文多模式模糊匹配算法∥第一届全国信息检索与内容安全学术会议论文集.上海:中国中文信息学会,2004.
16 杜艾永,李立顺,朱愿等.基于汉字机内编码的中文相似重复记录消除研究.电脑知识与技术,2009,5(29):8314-8316.
Du A Y,Li L S,Zhu Y,et al. Research on eliminating duplicate records based on Chinese character code. Computer Knowledge and Technology,2009,5(29):8314-8316.
17 Wu F,Cai Y X.A Chinese message sensitive words filtering system based on DFA and Word2vec.Procedia Computer Science,2018,139:293-298.
[1]  曹明,于小利,罗中涌,公勋,章德
.  利用时域差分法对薄膜体声波谐振进行二维分析[J]. 南京大学学报(自然科学版), 2013, 49(1): 40-45.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 巢 前, 蔡进功, 李艳丽, 王国力. 泥岩中的有机质对基于XRD的伊蒙混层结构计算的影响[J]. 南京大学学报(自然科学版), 2019, 55(2): 291 -300 .
[2] 杨 博, 殷 勇, 高 抒, 贾培宏, 夏 真. 北部湾全新世UK37的古海水表面温度重建研究[J]. 南京大学学报(自然科学版), 2019, 55(2): 320 -331 .
[3] 柴变芳,魏春丽,曹欣雨,王建岭. 面向网络结构发现的批量主动学习算法[J]. 南京大学学报(自然科学版), 2019, 55(6): 1020 -1029 .
[4] 黄华娟,韦修喜. 基于自适应调节极大熵的孪生支持向量回归机[J]. 南京大学学报(自然科学版), 2019, 55(6): 1030 -1039 .
[5] 陈石,张兴敢. 基于小波包能量熵和随机森林的级联H桥多电平逆变器故障诊断[J]. 南京大学学报(自然科学版), 2020, 56(2): 284 -289 .