南京大学学报(自然科学版) ›› 2020, Vol. 56 ›› Issue (1): 67–73.doi: 10.13232/j.cnki.jnju.2020.01.008

• • 上一篇    下一篇

一种用于数据流自适应分类的主动学习方法

张银芳1,于洪1(),王国胤1,谢永芳2   

  1. 1. 重庆邮电大学计算智能重庆市重点实验室,重庆,400065
    2. 中南大学信息科学与工程学院,长沙,410083
  • 收稿日期:2019-08-01 出版日期:2020-01-30 发布日期:2020-01-10
  • 通讯作者: 于洪 E-mail:yuhong@cqupt.edu.cn
  • 基金资助:
    国家自然科学基金(61876027)

An active learning method for data stream adaptive classification

Yinfang Zhang1,Hong Yu1(),Guoyin Wang1,Yongfang Xie2   

  1. 1. Chongqing Key Laboratory of Computational Intelligence,Chongqing University of Posts and Telecommunications,Chongqing,400065,China
    2. School of Information Science and Engineering,Central South University,Changsha,410083,China
  • Received:2019-08-01 Online:2020-01-30 Published:2020-01-10
  • Contact: Hong Yu E-mail:yuhong@cqupt.edu.cn

摘要:

概念漂移会导致数据流分类模型的分类能力随时间发展而下降,这就要求分类模型有自适应的能力.现有的大多数自适应概念漂移的数据流分类模型往往假设数据输入分类模型得到预测标签之后就可以得到其真实标签,但这种假设在某些情况下是不合理的,因为数据标记往往成本高、耗时长.因此,针对数据流少量标签的问题,在考虑主动学习可能出现采样偏差的情况下,结合不确定性主动学习策略以及边界点和离群点检测方法(Boundary and Outlier Detection,BOD),提出一种新的主动学习方法ALBOD(Active Learning Based on Boundary and Outlier Detection).比较实验的结果表明,在概念漂移发生的情况下,与100%标记算法OzaBagAdwin(OBA)和HoeffdingAdaptiveTree(HAT)相比,ALBOD主动学习方法只需要平均20%左右的标签就可以使分类器保持同等分类精度,说明新方法ALBOD有良好的主动学习能力.

关键词: 数据流, 概念漂移, 主动学习, 自适应分类

Abstract:

Concept drift will cause the ability of data stream classification model to decrease with time,which requires the classification model with the ability of self?adaptation. However,most of the existing data stream classification models adapting to concept drift ignore the limited label of data stream. They usually assume that the coming data input to classification model can get the real label after the predicted label obtained. But,this assumption is unreasonable in some cases,as labeling data tends to be costly and time?consuming. In this paper,a new active learning method,ALBOD (Active Learning based on Boundary and Outlier Detection),is proposed to solve the problem of the scarcity of data stream labels. Our method considers the problem of sampling bias which may occur in the process of active learning by combining the uncertainty active learning method with the BOD (Boundary and Outlier Detection) method. The first criterion is the most common active learning criterion,which selects instances that are the most uncertain in terms of class membership. The latter curbs the sampling bias by using the fact that boundary samples and outliers can reflect the feature space. Compared to the 100% label algorithm OzaBagAdwin (OBA) and HoeffdingAdaptiveTree (HAT),the proposed algorithm ALBOD can make the classification model maintain the same accuracy learning an average of about 20% labels under concept drift. The experiments show that though ALBOD is a sample combination of the above two criterions,it has a good active learning ability.

Key words: data stream, concept drift, active learning, adaptive classification

中图分类号: 

  • TP391

图1

基于主动学习的数据流自适应分类模型"

图2

结合不确定性和BOD的主动学习方法"

表1

实验所用的数据集"

Datasets Objects Attributes Classes
SEAa 10000 3 2
SEAg 10000 3 2
RBF0.003 100000 10 2
Weather 18159 8 2

表2

与主动学习算法ALU对比的分类效果"

Datasets ALU[22] ALBOD
ACC(%) Labeled (%) ACC (%) Labeled (%)
SEAa 81.2 10 83.16 9.92
SEAg 81.5 10 82.95 9.79
RBF0.003 48.8 10 50.94 4.27
Weather 42.1 10 73.30 1.56

表3

与全标记算法OBA,HAT和ACM对比的分类效果"

Datasets OBA[22] HAT[22] ACM ALBOD
ACC (%) ACC (%) ACC (%) (Labeled (%)) ACC (%) (Labeled (%))
SEAa 84.21 84.98 83.31 (100) 83.16 (9.92)
SEAg 83.51 83.44 83.29 (100) 83.36 (10.75)
RBF0.003 62.61 59.14 57.37 (100) 55.53 (22.28)
Weather 71.51 71.33 77.98 (100) 77.14 (20.82)
1 Schlimmer J C , Granger R H Jr . Incremental learning from noisy data. Machine Learning,1986,1(3):317-354.
2 郑灿彬,闻立杰,王建民 . 基于可扩展活动关系的过程概念漂移检测. 计算机集成制造系统,2018,24(7):1589-1597.
Zheng C B,Wen L J,Wang J M. Process concept drift detection based on extensible activity relationship. Computer Integrated Manufacturing Systems,2018,24(7):1589-1597.
3 Ditzler G , Roveri M , Alippi C ,et al . Learning in nonstationary environments:a survey. IEEE Computational Intelligence Magazine,2015,10(4):12-25.
4 ZareMoodi P , Beigy H , Siahroudi S K . Novel class detection in data streams using local patterns and neighborhood graph. Neurocomputing,2015,158:234-245.
5 孙艳歌,王志海,原继东 等 . 基于信息熵的数据流自适应集成分类算法. 中国科学技术大学学报,2017,47(7):575-582.
Sun Y G , Wang Z H , Yuan J D ,et al . Adaptive ensemble classification algorithm for data streams based on information entropy. Journal of University of Science and Technology of China,2017,47(7):575-582.
6 Ahmadi Z , Beigy H . Semi?supervised ensemble learning of data streams in the presence of concept drift∥The 7th International Conference on Hybrid Artificial Intelligence Systems. Springer Berlin Heidelberg,2012:526-537.
7 Haque A , Khan L , Baron M . Sand:semi?supervised adaptive novel class detection and classification over data stream∥The 30th AAAI Conference on Artificial Intelligence.Phoenix,AZ,USA:AAAI,2016:1652-1658.
8 Settles B . Active learning literature survey. computer Sciences. Technical Report 1648. Madison:University of Wisconsin?Madison,2009:3-4.
9 Fu Y F , Zhu X Q , Li B . A survey on instance selection for active learning. Knowledge and Information Systems,2013,35(2):249-283.
10 Mohamad S , Bouchachia A , Sayed?Mouchaweh M . A bi?criteria active learning algorithm for dynamic data streams. IEEE Transactions on Neural Networks and Learning Systems,2018,29(1):74-86.
11 Zliobaite I , Bifet A , Pfahringer B ,et al . Active learning with drifting streaming data. IEEE Transactions on Neural Networks and Learning Systems,2013,25(1):27-39.
12 Mohamad S , Sayed?Mouchaweh M , Bouchachia A . Active learning for classifying data streams with unknown number of classes. Neural Networks,2018,98:1-15.
13 Li Y , Maguire L . Selecting critical patterns based on local geometrical and statistical information. IEEE Transactions on Pattern Analysis and Machine Intelligence,2011,33(6):1189-1201.
14 Li X J , Lv J C , Yi Z . An efficient representation?based method for boundary point and outlier detection. IEEE Transactions on Neural Networks and Learning Systems,2016,29(1):51-62.
15 Ahmed M , Mahmood A N , Hu J K . A survey of network anomaly detection techniques. Journal of Network and Computer Applications,2016,60:19-31.
16 Salehi M , Rashidi L . A survey on anomaly detection in evolving data:with application to forest fire risk prediction. ACM SIGKDD Explorations Newsletter,2018,20(1):13-23.
17 Gao Z W , Cecati C , Ding S X . A survey of fault diagnosis and fault?tolerant techniques?Part I:fault diagnosis with model?based and signal?based approaches. IEEE Transactions on Industrial Electronics,2015,62(6):3757-3767.
18 Chandola V , Banerjee A , Kumar V . Anomaly detection:a survey. ACM Computing Surveys,2009,41(3):15.
19 Agrawal S , Agrawal J . Survey on anomaly detection using data mining techniques. Procedia Computer Science,2015,60:708-713.
20 Lu J , Liu A J , Dong F ,et al . Learning under concept drift:a review. IEEE Transactions on Knowledge and Data Engineering,2018,doi:10.1109/TKDE. 2018.2876857 .
doi: 10.1109/TKDE. 2018.2876857
21 Roweis S T , Saul L K . Nonlinear dimensionality reduction by locally linear embedding. Science,2000,290(5500):2323-2326.
22 Bifet A , Holmes G , Kirkby R ,et al . MOA:massive online analysis. Journal of Machine Learning Research,2010,11:1601-1604.
23 Elwell R , Polikar R . Incremental learning of concept drift in nonstationary environments. IEEE Transactions on Neural Networks,2011,22(10):1517-1531.
[1] 汪敏,赵飞,闵帆. 储层预测的代价敏感主动学习算法[J]. 南京大学学报(自然科学版), 2020, 56(4): 561-569.
[2] 王卫星,刘兆伟,石敬华. 基于时间敏感滑动窗口的CP⁃nets结构学习[J]. 南京大学学报(自然科学版), 2020, 56(2): 175-185.
[3] 柴变芳,魏春丽,曹欣雨,王建岭. 面向网络结构发现的批量主动学习算法[J]. 南京大学学报(自然科学版), 2019, 55(6): 1020-1029.
[4] 黄 帷,闵 帆*,任 杰. 基于协同过滤加权预测的主动学习缺失值填补算法[J]. 南京大学学报(自然科学版), 2018, 54(4): 758-.
[5] 宋威,刘明渊,李晋宏. 基于事务型滑动窗口的数据流中高效用项集挖掘算法[J]. 南京大学学报(自然科学版), 2014, 50(4): 494-.
[6]  汤克明1,2,戴彩艳1,陈峻3.4**
.  一种基于滑动窗口的不确定数据流Top一K 查询算法*
[J]. 南京大学学报(自然科学版), 2012, 48(3): 351-359.
[7]  白龙飞1,王文剑2**,郭虎升1.  一种新的支持向量机主动学习策略*
[J]. 南京大学学报(自然科学版), 2012, 48(2): 182-189.
[8]  据春华,帅朝谦**,封毅
.  基于粒计算的商业数据流概念漂移特征选择*[J]. 南京大学学报(自然科学版), 2011, 47(4): 391-397.
[9]  赵飞, 刘奇志** , 张剡, 柏文阳 .  一种大域数据流中缺失值的填充方法*

[J]. 南京大学学报(自然科学版), 2011, 47(1): 32-39.
[10]  卞磊 * , 刘超, 金茂忠 .  一种面向审查的过程内数据流异常自动检测方法

[J]. 南京大学学报(自然科学版), 2010, 46(1): 71-76.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 邱 浩,王欣然*. 二硫化钼的电子输运与器件[J]. 南京大学学报(自然科学版), 2014, 50(3): 280 .
[2] 王学锋1,2*,徐永兵1,2*,张 荣1,2. 低维磁性耦合体系的新物性及电/光场调控进展[J]. 南京大学学报(自然科学版), 2014, 50(3): 309 .
[3] 骆乾坤*,吴剑锋2,杨运3,钱家忠1. 渗透系数空间变异程度对进化算法优化结果影响评价[J]. 南京大学学报(自然科学版), 2015, 51(1): 60 -66 .
[4] 孙大军1,2, 王永恒1,2*, 勇俊1,2. 实频数据技术在水声换能器宽带匹配中的应用[J]. 南京大学学报(自然科学版), 2015, 51(6): 1182 -1188 .
[5] 杨政予1,王新龙1
. 茅山军号声现象的进一步研究[J]. 南京大学学报(自然科学版), 2015, 51(6): 1097 -1106 .
[6] 张亚平,2,万宇1,2,聂青3,阮晓红1,2*,王子健4. 湖泊水体中氮的生物地球化学过程及其生态学意义[J]. 南京大学学报(自然科学版), 2016, 52(1): 5 -15 .
[7] 李荣富1,2,罗跃辉 ,2,曾洪玉1,2,阮晓红1,2*,刘丛强3*. 稳定同位素技术在环境水体氮的生物地球化学循环研究中的应用[J]. 南京大学学报(自然科学版), 2016, 52(1): 16 -26 .
[8] 李 婷1,张超智1,2*,沈 丹1,袁 阳1. 石墨烯和氧化石墨烯的生物体毒性研究进展[J]. 南京大学学报(自然科学版), 2016, 52(2): 235 .
[9] 涂 臻*,卢 晶 . 散射条件下小尺度扬声器阵列声聚焦算法鲁棒性研究[J]. 南京大学学报(自然科学版), 2016, 52(2): 382 .
[10] 葛 勇1,孙宏祥1,2*,袁寿其1,夏建平1,管义钧1 . 含对称三角形腔的波导管中宽带低频隔声效应[J]. 南京大学学报(自然科学版), 2016, 52(4): 619 .