南京大学学报(自然科学版) ›› 2019, Vol. 55 ›› Issue (1): 102109.doi: 10.13232/j.cnki.jnju.2019.01.010
严云洋1,2*,瞿学新1,2,朱全银1,李 翔1,赵 阳1
Yan Yunyang1,2*,Qu Xuexin1,2,Zhu Quanyin1,Li Xiang1,Zhao Yang1
摘要: 为度量在网络日志中网页分类模型的预测结果,将度量为可信的结果加入网址分类集合,提高网络日志中访问链接的分类效率,提出一种基于离群点检测的分类结果置信度的度量方法. 采用基于Bagging构建多个弱分类器对待分类数据进行预测,并对每个预测结果构建各类别的概率向量,根据离群点检测来度量模型的预测结果是否为可信. 在UCI公共数据集上,使用主流的基于k均值和基于局部密度的度量方法进行了对比实验. 实验结果表明,应用基于离群点检测的分类结果置信度,基于k均值的度量方法和基于局部密度的度量方法均显著提高了准确率. 另外,在工程项目爬取的网页分类中也取得了同样的效果.
中图分类号:
[1] 国家图书馆研究院. CNNIC发布第37次《中国互联网络发展状况统计报告》. 国家图书馆学刊,2016(2):76. [2] 郝水龙,吴共庆,胡学钢. 基于层次向量空间模型的用户兴趣表示及更新. 南京大学学报(自然科学),2012,48(2):190-197.(Hao S L,Wu G Q,Hu X G. Presentation and updation for user profile based on hierarchical vector space model. Journal of Nanjing University(Natural Sciences),2012,48(2):190-197.) [3] 顾 敏,郭 庆,曹 野等. 基于结构和文本特征的网页分类技术研究. 中国科学技术大学学报,2017,47(4):290-296.(Gu M,Guo Q,Cao Y,et al. Research on web page automatic categorization based on structural and text information. Journal of University of Science and Technology of China,2017,47(4):290-296.) [4] Singh N,Chaudhari N S,Singh N. Online URL classification for large-scale streaming environments. IEEE Intelligent Systems,2017,32(2):31-36. [5] Sabbah T,Selamat A,Selamat M H,et al. Hybridized term-weighting method for Dark Web classification. Neurocomputing,2016,173:1908-1926. [6] Ali F,Khan P,Riaz K,et al. A fuzzy ontology and SVM-based web content classification system. IEEE Access,2017,5:25781-25797. [7] 朱全银,潘 禄,刘文儒等. Web科技新闻分类抽取算法. 淮阴工学院学报,2015,24(5):18-24.(Zhu Q Y,Pan L,Liu W R,et al. Categorization extraction algorithm for scientific-related news on websites. Journal of Huaiyin Institute of Technology,2015,24(5):18-24.) [8] 李 翔,朱全银. Adaboost算法改进BP神经网络预测研究. 计算机工程与科学,2013,35(8):96-102.(Li X,Zhu Q Y. Prediction of improved BP neural network by Adaboost algorithm. Computer Engineering & Science,2013,35(8):96-102.) [9] 李 翔,朱全银,王 尊. 基于可变基函数和GentleAdaBoost的小波神经网络研究. 山东大学学报(工学版),2013,43(5):31-38.(Li X,Zhu Q Y,Wang Z. Research of wavelet neural network based on variable basis functions and GentleAdaBoost algorithm. Journal of Shandong University(Engineering Science),2013,43(5):31-38.) [10] Van Capelleveen G,Poel M,Mueller R M,et al. Outlier detection in healthcare fraud:A case study in the Medicaid dental domain. International Journal of Accounting Information Systems,2016,21:18-31. [11] Nesa N,Ghosh T,Banerjee I. Non-parametric sequence-based learning approach for outlier detection in IoT. Future Generation Computer Systems,2018,82:412-421. [12] Hodge V,Austin J. A survey of outlier detection methodologies. Artificial Intelligence Review,2004,22(2):85-126. [13] Oliveira G V,Coutinho F P,Campello R J G B,et al. Improving k-means through distributed scalable metaheuristics. Neurocomputing,2017,246:45-57. [14] Peng K,Leung V C M,Huang Q J. Clustering approach based on mini batch kmeans for intrusion detection system over big data. IEEE Access,2018,6:11897-11906. [15] 杨绪兵,王一雄,陈 斌. 马氏度量学习中的几个关键问题研究及几何解释. 南京大学学报(自然科学),2013,49(2):133-141.(Yang X B,Wang Y X,Chen B. Research on several key problems of Mahalanobis metric learning and corresponding geometrical interpretaions. Journal of Nanjing University(Natural Sciences),2013,49(2):133-141.) [16] Ashok P,Nawaz G M K. Outlier detection method on UCI repository dataset by entropy based rough K-means. Defence Science Journal,2016,66(2):113-121. [17] 邹云峰,张 昕,宋世渊等. 基于局部密度的快速离群点检测算法. 计算机应用,2017,37(10):2932-2937.(Zou Y F,Zhang X,Song S Y,et al. Fast outlier detection algorithm based on local density. Journal of Computer Applications,2017,37(10):2932-2937.) [18] Breiman L. Bagging predictors. Machine Learning,1996,24(2):123-140. |
[1] | 黄添强1.2**陈智文1,苏立超1,郑之1,袁秀娟1 . 利用内容连续性的数字视频篡改检测* [J]. 南京大学学报(自然科学版), 2011, 47(6): 493-503. |
[2] | 申 彦**,宋顺林,朱玉全 . 一种基于半监督的大规模数据集聚类算法* [J]. 南京大学学报(自然科学版), 2011, 47(4): 372-382. |
|