南京大学学报(自然科学版) ›› 2019, Vol. 55 ›› Issue (1): 102–109.doi: 10.13232/j.cnki.jnju.2019.01.010

• • 上一篇    下一篇

基于离群点检测的分类结果置信度的度量方法

严云洋1,2*,瞿学新1,2,朱全银1,李 翔1,赵 阳1   

  1. 1. 淮阴工学院计算机与软件工程学院,淮安,223003; 2. 西南科技大学计算机科学与技术学院,绵阳,621010
  • 接受日期:2018-12-10 出版日期:2019-02-01 发布日期:2019-01-26
  • 通讯作者: 严云洋, E-mail:yunyang@hyit.edu.cn E-mail:yunyang@hyit.edu.cn
  • 基金资助:
    江苏省“六大人才高峰”项目(2013DZXX-023),江苏省“青蓝工程”,江苏省重点研发计划(BE2015127)

Confidence measure method of classification results based on outlier detection

Yan Yunyang1,2*,Qu Xuexin1,2,Zhu Quanyin1,Li Xiang1,Zhao Yang1   

  1. 1. Faculty of Computer and Software Engineering,Huaiyin Institute of Technology,Huai’an,223003,China; 2. School of Computer Science and Technology,Southwest University of Science and Technology,Mianyang,621010,China
  • Accepted:2018-12-10 Online:2019-02-01 Published:2019-01-26
  • Contact: Yan Yunyang, E-mail:yunyang@hyit.edu.cn E-mail:yunyang@hyit.edu.cn

摘要: 为度量在网络日志中网页分类模型的预测结果,将度量为可信的结果加入网址分类集合,提高网络日志中访问链接的分类效率,提出一种基于离群点检测的分类结果置信度的度量方法. 采用基于Bagging构建多个弱分类器对待分类数据进行预测,并对每个预测结果构建各类别的概率向量,根据离群点检测来度量模型的预测结果是否为可信. 在UCI公共数据集上,使用主流的基于k均值和基于局部密度的度量方法进行了对比实验. 实验结果表明,应用基于离群点检测的分类结果置信度,基于k均值的度量方法和基于局部密度的度量方法均显著提高了准确率. 另外,在工程项目爬取的网页分类中也取得了同样的效果.

关键词: 离群点, 网页分类, k均值, LOF算法

Abstract: In order to measure the prediction result of the webpage classification model,a novel confidence measure method of classification results is proposed based on outlier detection by adding the measurement result as a reliable result to the URL classification set to improve the classification efficiency of the link in the weblog. The Bagging-based weak classifiers first are used to predict the classification data. In addition,the probability vectors of different types are constructed for each prediction result. Then,the credibility of the prediction results is measured by outlier detection. The proposed confidence measure method is used by k-means-based measurement and local density-based measurement to webpage classification on UCI data set. The experimental results show that the accuracy of the classification results based on outlier detection are significantly improved respectively. Furthermore,the same effect is achieved in the classification of web pages crawled from engineering projects.

Key words: outliers, webpage classification, k-means, LOF

中图分类号: 

  • TP391.1
[1] 国家图书馆研究院. CNNIC发布第37次《中国互联网络发展状况统计报告》. 国家图书馆学刊,2016(2):76.
[2] 郝水龙,吴共庆,胡学钢. 基于层次向量空间模型的用户兴趣表示及更新. 南京大学学报(自然科学),2012,48(2):190-197.(Hao S L,Wu G Q,Hu X G. Presentation and updation for user profile based on hierarchical vector space model. Journal of Nanjing University(Natural Sciences),2012,48(2):190-197.)
[3] 顾 敏,郭 庆,曹 野等. 基于结构和文本特征的网页分类技术研究. 中国科学技术大学学报,2017,47(4):290-296.(Gu M,Guo Q,Cao Y,et al. Research on web page automatic categorization based on structural and text information. Journal of University of Science and Technology of China,2017,47(4):290-296.)
[4] Singh N,Chaudhari N S,Singh N. Online URL classification for large-scale streaming environments. IEEE Intelligent Systems,2017,32(2):31-36.
[5] Sabbah T,Selamat A,Selamat M H,et al. Hybridized term-weighting method for Dark Web classification. Neurocomputing,2016,173:1908-1926.
[6] Ali F,Khan P,Riaz K,et al. A fuzzy ontology and SVM-based web content classification system. IEEE Access,2017,5:25781-25797.
[7] 朱全银,潘 禄,刘文儒等. Web科技新闻分类抽取算法. 淮阴工学院学报,2015,24(5):18-24.(Zhu Q Y,Pan L,Liu W R,et al. Categorization extraction algorithm for scientific-related news on websites. Journal of Huaiyin Institute of Technology,2015,24(5):18-24.)
[8] 李 翔,朱全银. Adaboost算法改进BP神经网络预测研究. 计算机工程与科学,2013,35(8):96-102.(Li X,Zhu Q Y. Prediction of improved BP neural network by Adaboost algorithm. Computer Engineering & Science,2013,35(8):96-102.)
[9] 李 翔,朱全银,王 尊. 基于可变基函数和GentleAdaBoost的小波神经网络研究. 山东大学学报(工学版),2013,43(5):31-38.(Li X,Zhu Q Y,Wang Z. Research of wavelet neural network based on variable basis functions and GentleAdaBoost algorithm. Journal of Shandong University(Engineering Science),2013,43(5):31-38.)
[10] Van Capelleveen G,Poel M,Mueller R M,et al. Outlier detection in healthcare fraud:A case study in the Medicaid dental domain. International Journal of Accounting Information Systems,2016,21:18-31.
[11] Nesa N,Ghosh T,Banerjee I. Non-parametric sequence-based learning approach for outlier detection in IoT. Future Generation Computer Systems,2018,82:412-421.
[12] Hodge V,Austin J. A survey of outlier detection methodologies. Artificial Intelligence Review,2004,22(2):85-126.
[13] Oliveira G V,Coutinho F P,Campello R J G B,et al. Improving k-means through distributed scalable metaheuristics. Neurocomputing,2017,246:45-57.
[14] Peng K,Leung V C M,Huang Q J. Clustering approach based on mini batch kmeans for intrusion detection system over big data. IEEE Access,2018,6:11897-11906.
[15] 杨绪兵,王一雄,陈 斌. 马氏度量学习中的几个关键问题研究及几何解释. 南京大学学报(自然科学),2013,49(2):133-141.(Yang X B,Wang Y X,Chen B. Research on several key problems of Mahalanobis metric learning and corresponding geometrical interpretaions. Journal of Nanjing University(Natural Sciences),2013,49(2):133-141.)
[16] Ashok P,Nawaz G M K. Outlier detection method on UCI repository dataset by entropy based rough K-means. Defence Science Journal,2016,66(2):113-121.
[17] 邹云峰,张 昕,宋世渊等. 基于局部密度的快速离群点检测算法. 计算机应用,2017,37(10):2932-2937.(Zou Y F,Zhang X,Song S Y,et al. Fast outlier detection algorithm based on local density. Journal of Computer Applications,2017,37(10):2932-2937.)
[18] Breiman L. Bagging predictors. Machine Learning,1996,24(2):123-140.
[1]  黄添强1.2**陈智文1,苏立超1,郑之1,袁秀娟1
.  利用内容连续性的数字视频篡改检测*
[J]. 南京大学学报(自然科学版), 2011, 47(6): 493-503.
[2]  申 彦**,宋顺林,朱玉全
.  一种基于半监督的大规模数据集聚类算法*
[J]. 南京大学学报(自然科学版), 2011, 47(4): 372-382.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 梅世嘉,施 斌,曹鼎峰,魏广庆,张 岩,郝 瑞. 基于AHFO方法的Green-Ampt模型K0取值试验研究[J]. 南京大学学报(自然科学版), 2018, 54(6): 1085 -1094 .
[2] 许 林,张 巍*,梁小龙,肖 瑞,曹剑秋. 岩土介质孔隙结构参数灰色关联度分析[J]. 南京大学学报(自然科学版), 2018, 54(6): 1105 -1113 .
[3] 孙 玫,张 森,聂培尧,聂秀山. 基于朴素贝叶斯的网络查询日志session划分方法研究[J]. 南京大学学报(自然科学版), 2018, 54(6): 1132 -1140 .
[4] 周星星,张海平,吉根林. 具有时空特性的区域移动模式挖掘算法[J]. 南京大学学报(自然科学版), 2018, 54(6): 1171 -1182 .
[5] 韩明鸣, 郭虎升, 王文剑. 面向非平衡多分类问题的二次合成QSMOTE方法[J]. 南京大学学报(自然科学版), 2019, 55(1): 1 -13 .
[6] 刘 素, 刘惊雷. 基于特征选择的CP-nets结构学习[J]. 南京大学学报(自然科学版), 2019, 55(1): 14 -28 .
[7] 王伯伟, 聂秀山, 马林元, 尹义龙. 基于语义相似度的无监督图像哈希方法[J]. 南京大学学报(自然科学版), 2019, 55(1): 41 -48 .
[8] 孔 颉, 孙权森, 纪则轩, 刘亚洲. 基于仿射不变离散哈希的遥感图像快速目标检测新方法[J]. 南京大学学报(自然科学版), 2019, 55(1): 49 -60 .
[9] 贾海宁, 王士同. 面向重尾噪声的模糊规则模型[J]. 南京大学学报(自然科学版), 2019, 55(1): 61 -72 .
[10] 胡 太, 杨 明. 结合目标检测的小目标语义分割算法[J]. 南京大学学报(自然科学版), 2019, 55(1): 73 -84 .