南京大学学报(自然科学版) ›› 2012, Vol. 48 ›› Issue (2): 182–189.

• • 上一篇    下一篇

 一种新的支持向量机主动学习策略*

 白龙飞1,王文剑2**,郭虎升1   

  • 出版日期:2015-05-28 发布日期:2015-05-28
  • 作者简介: (1.山西大学计算机与信息技术学院,太原,030006;
    2.山西大学计算智能与中文信息处理教育部重点实验室,太原,030006)
  • 基金资助:
     国家自然科学基金(80976035,教育部新川_纪人才支持计划项日(NCE1-07-0625,教育部博士点基金
    (20091401110003),11I西省自然科学基金(2009011017-2) , 山西省研究生创新项日(20103021)

 A novel support vector machine active learning strategy

 Bai Long Fei1,Wang Wen一Jian2,Guo Hu Sheng1
  

  • Online:2015-05-28 Published:2015-05-28
  • About author: (1 .School of Computer and informationTechnology, Shanxi University,Taiyuan, 030006,China;
    2. Key Laboratory of Computational Intelligence and Chinese Information Processing
    of Ministry of Education, Shanxi University,Taiyuan, 030006,China)

摘要:  木文提出一种新的支持向量机(support vector machine, SVM)主动学习策略,称为Dix- SVMactive.通过定义新的数据置信度度量来挑选最有价值样木进行人工标注,并在每次迭代中对训练集的平衡度进行调整,以获得更好的泛化能力.在UCI标准数据集上的测试结果表明,与基于随机选样的SVMactive和传统SVMactivce(Tong SVMactive)方法相比,木文算法不仅可以提高分类精度,而且能减少人工标注的工作量.

Abstract:  This paper proposes a new strategy of active learning for support vector machine CSVM),which is called Dix-SVMactive. Ucnerally, the shorter the distance between the sample and the hyperplane, the more uncertainty and more information the sample contains, and thus it is of more value. Active learning is an iterative process,so the convergence speed should also be considered. In this paper, by defining a new confidence measure parameter about samples, the most valuable samples will be selected to be marked artificially.The confidence of a given unlabeled sample,which can be regarded as the value of the sample,is defined as the ratio of the mean value of the distance between the presented sample and the labeled samples to the distance between the presented sample and the hyperplane. While, the mean value of the distance between the presented sample and the labeled samples can measure the redundancy rate of the given sample to labeled samples,and the distance between the presented sample and the hyperplane can express the uncertainty of the sample. In general,the bigger the former and the smaller the latter, the bigger is the confidence of the sample. Additionally, the set of labeled sample obtained after each loop may be unbalanced,which means the hyperplane may be a little far away from one kind of samples and more close from another kind of samples. In this situation, according to the proposed approach to select samples,the number of samples close to the hyperplane will be more than that far from the hyperplane, and this may be lead to bad generalization performance.To avoid the unbalance of dataset, after each loop the proposed algorithm will test the balance degree of the dataset,which is the ratio of the number of minority samples to that of majority ones. When the ratio is not greater than a given threshold:,the dataset will be regarded as unbalanced. At this time, some samples belonging to the majority samples will be deleted by some strategy like clustering to make numbers of two
classes samples be equal. During each iterative step, the balance degree of the selected dataset will be adjusted so as to obtain good generalization ability. Summarily, the confidence of each sample is computed firstly, and then the first a few samples will be added into the training dataset according to the confidence in descend sort. At last,the balance of the training dataset in each loop will be adjusted.The experiment results on University of California Irvine benchmatk datasets demonstrate that the proposed approach can not only improve the classification precision, but also reduce the workload of marking samples artificially compared to some common used approaches,c. g.,the SVMactive, which is based on the random sample, and theTong SVMactive approach.

[1]Simon H A,Lea G. Problem solving and rule education:A unified view knowledge and organ ization. Erbuam, 1974,15(2):63一73.
[2]Han G, Zhao C X, Hu X L. A new SVM active learning algorithm and its application in obstacle detection. Journal of Computer Research and
Development, 2009,46(11):15一20.(韩光,赵春霞,胡雪蕾.一种新的SVM主动学习算法及其在障碍物检测中的应用.计算机研究与发屏.2009.46(11).15一20)
[3]Dagan I,Engclson S. Committee-bascd sampling for training probabilistic classifiers. Proceedings of the 12thinternational Conference on Machine
learning. Tahoe City; Morgan Kavfmann, 1995,50一157.
[4]Lewis W, Gale A. A sequential algorithm for training text classifiers(uncertainty sampling). Proceedings of the 17th Annual International
Conference on Research and Development in In- formation Retrieval,London; Springer-Verlag, 1994,3一12.
[5]Tong S, Koller D. Support vector machine ac tivc learning with applications to text Classifica tion. Journal of Machine Learning Research 2001,45一66.
[6]Schohn G, Cohn D. Less is more; Active learn- ing with support vector machines. Proceedings of the 17th international Conference on Machine
Learning. San Francisco; Morgan Kaufmann, 2000,45一66.
[7]Scung H S, Opper M,Sompolinsky H. Query by committee. Proceedings of the 5th Annual ACM Workshop on Computational Learning
Theory. University of Clifornia: Association for Computing Machinery, 1992,287一294.
[8]Freund Y,Seung H S, Samir E, et al. Selective sampling using the query by committee algo-rithm.Machine Learning,1997,28(23):133~168.
[9]Vladimir N V.The nature of statistical learning theory. New York:Springer, 2000,1一334.
[10]Vapnik V. Statictical learning theory. New York.Wilev. 1998.11一23
[11]Suykens J,Vandewa L J Least squares support vector machine classifiers.Neural Processing Letters,1999.9(3):293~300.
[12]Osuna E,Freund R,Girosi F.Trammg support vector machines:An application to face deter tion. IEEE Computer Society Conference on
Computer Vision and Patten Recognition, 1997,130一136.
[13]Mukerjee S, Osuna E,Girosi F. Nonlinear pre- diction of chaotic time series using a support vector machine. Principle J,Giles I.,Morgan
N. Proceedings of the 1997 IEEE Workshop on Neural Networks for Signal Processing. New York: IEEE Press, 1997:1125一1132.
[14]SimonT.Active learning;Theory and applica tion. Stanford University, 2001,24一31.
[15]Ranganath A K,Iamnitchi A,Foster I. Impro- ving data availability through dynamic mod- el-Driven Replication in Large Peer-to-Peer
Communities. CCGrid,2002,376一381.
[16]Long J,Yin J P,Zhu F,et al. Cost sensitive active learning algorithm for introsion detection. Journal of Nanjing University(Natural Sci-
ences),2008,44(5):527一535.(龙军,殷建平,祝恩等.针对入健检测的代价敏感主动学习算法.南京大学学报(自然科学),2008,44 (5):527一535).





No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!