南京大学学报(自然科学版) ›› 2013, Vol. 49 ›› Issue (2): 266–234.

• • 上一篇    下一篇

 带通配符的多序列模式挖掘*

 马晓文1,胡学钢1**,谢飞1·2,郭丹1   

  • 出版日期:2015-10-27 发布日期:2015-10-27
  • 作者简介: (1.合肥工业大学计算机与信息学院,合肥,230009:2.合肥师范学院计算机科学与技术系,合肥,230601)
  • 基金资助:
     国家"863”计划(2012AA011005),国家自然科学基金(60975034),安徽省自然科学基金(11040606 M134)

 Mining sequential patterns across multiple sequences with gap constraints

 Ma Xiao-Wen1,Hu Xue-Gang1,Xie Fei1’2,Guo Dan1   

  • Online:2015-10-27 Published:2015-10-27
  • About author: (1. School of Computer Science and Information Engineering, Hefei University of Technology,Hefei , 230009 , China;
    2. Department of Computer Science and Technology, Hefei Normal University, Hefei,230601 ,China)

摘要:  带有通配符的多序列模式挖掘在文木检索、网络安全、生物科学等领域中具有很重要的作用.
通过挖掘多序列模式,能够透彻的了解序列之间的联系,在各个领域中具有重要的现实意义.在己有的
工作中,随着多序列集长度的增大,挖掘的规模呈现指数级增长.研究这样一个问题:给定多条序列s1
…,sn,支持度阂值和间隔约束,从多序列中挖掘所有出现次数不小于给定支持度阂值的频繁序列模式,
并且要求模式中任意两个相邻元素在序列中的出现位置满足用户定义的间隔约束.设计了一个有效的
算法M-OneOffMine,模式在序列中的出现满足one-off条件.在生物DNA序列上的实验结果表明,
M-OneOffMine算法比相关的序列模式挖掘算法具有更好的时间性能.

Abstract:  Minim multi-sequential patterns with gap constraints is an important research task in many domains,
such as text retrieval,network security, and biological scicnce.ln the previous work,with the length of the multi-se-
qucnce increasing, the mining scale presents exponential increasing, and those algorithms merely mined patterns with
the limited length. Uiven the sequences s1,…,sn,a certain threshold ,and gap constraints,we aim to discover frequent
patterns whose supports in multiple sequence arc no less than the given threshold value.There arc flexible wildcards
in pattern P,and the number of the wildcards between any two successive elements of P fulfills the user-specified
gap constraints.ln this paper, we design an efficient mining algorithm, named M-OneOffMine that satisfies the once
off condition under which each character in the given sequence can be used at most once in all occurrences of a pat-
tern.The experiments on DNA sequences show that M-OncOffMinc has better time performances than the related al-
gorithms.The time and space complexities of M-OneOffMine arc respectively Othe number of frequent patterns,k is the number of clement sequcnces,n is the length of the pattern,l is the length
of the multiple sequcnce,and w is the flexibility of the gap constraint.

[1]Oaten T,Cohen P. Searching for structure in mul tiple streams of data. Proceedings of 13th lnterna
tional Conference on Machine Learning. Bari,Ita 1y,1996,346一354.
[2]Herzel H,Weiss O,Trifonov E. 10一11 bp perio- dicities in complete genomes reflect protein strur
ture and DNA folding. Bioinformatics, 1999,15 (3).187一193.
[3]Jonassen I. Efficient discovery of conserved pat- terns using a pattern graph. Computer Applica-
tions in the Biosciences:CABIOS, 1997,13:509 ~522.
[4]Rigoutsos I,Floratos A. Combinatorial pattern dis covery in biological sequences;The TFIRESIAS al
gorithm. Bioinlormatics, 1998,14(1):55一67.
[5]Xie F, Wu X D, Hu X G, et al. Sequential pattern mining with wildcards. Proceedings of the 22nd In-
ternational Conference on Tools with Artificial Intelligence. Arras,France,2010,241一247.
[6]Belkum A Van,Scherer S,Leeuwen W Van. Vari- able number of tandem repeats in clinical strains
of Hacmophilus influenzae. Infection and Immuni- ty,1997,5017一5027.
[7]Huang Y M, Wu X D, Hu X G, et al. Mining fre quent patterns with gaps and one-off condition. Pro-
ceedings of the international Conference on Compu- tational Science and Engineering. Vancouver, BC, USA,2009,180一186.
[8]Gao Y. Press of data mining in China. Journal of  Nanjing University(Natural Sciences),2011,47(4):
351 -353.(高阳.中国数据挖掘研究进展.南京大学学报(自然科学),2011,47(4):351-353).
[9]Agrawal R,Srikant R. Mining sequential pat  terns. Proceedings of the 11th international Con
ference on Data Engineering. Taipei,Taiwan 1995,3一14.
[10]Zou X, Zhang W, Liu Y, et al. Study on distrihuted sequential pattern discovery algorithm. Journal of
Software,2005,16(7):1262一1269.(邹祥,张巍,刘洋等.分布式序列模式发现算法的研究. 软件学报,2005,16(7):1262 -1269).
[11]Agrawal R, Srikant R. Fast algorithms for mining association rules. Proceedings of the international
Conference on Very Large Databases. Santiago,Chil- e; Morgan Kaufmann Publishing lnc, San Matco, CA,USA, 1994 ,487一499.
[12]Han J,Pei J,Mortazavi-Asl B, et al. FreeSpan;Frequent pattern-projected sequential pattern
mining. Proceedings of the international Confer once on Knowledge Discovery and Data Mining Boston, MA,2000,355一359.
[13]Pei J,Han J,Mortazavi-Asl B,et al. PrefixSpan; Mining sequential patterns efficiently by prefix-
projected pattern growth. Proceedings of the 17th IEEE international Conference on Data Engineer-
ing. Heidelberg, Germany; institute of Electrical and Electronics Engineers Computer Society, 2001,215一224.
[14]Zhang M H,Kao B,Cheung D,et al. Minim peri odic patterns with gap requirement from se- quences. Proceedings of the ACM SIUMOD In ternational Conference on Management of Data. Baltimore,MD,Unitcd States,2005,623一633.
[15]He Y, Wu X D, Zhu X Q. Mining frequent patterns with wildcards from biological sequences. Proceed-
ings of the IEEE international Conference on lnfor- mation Reuse and lntcgration,Las Vegas,NV,USA, 2007,329一334.
[16]Zhu X Q, Wu X D. Mining complex patterns across sequences with gap requirements. Proceeding of the
20th International Joint Conference on Artificial In telligence. Hyderabad,2007,726一735.
[17]Ding B, Lo D, Han J,et al. Efficient minim of closed repetitive gapped subsequences from a se  quence database. Proceedings of the 25th Interna  tional Conference on Data Engineering. Shanghai China,2009,1024一1035.
[18]Chen G,Wu X D,Zhu X Q. Efficient string matc-hing with wildcards and length constraints.
Knowledge and Information Systems,2006,10(4):339~419.
[19]http//www. ncbi. nlm. nih. gov.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!