南京大学学报(自然科学版) ›› 2024, Vol. 60 ›› Issue (1): 38–52.doi: 10.13232/j.cnki.jnju.2024.01.005

• • 上一篇    下一篇

面向大图的Top⁃Rank⁃K频繁模式挖掘算法

邹杰军, 王欣(), 石俊豪, 兰卓, 方宇, 张翀, 谢文波, 沈玲珍   

  1. 西南石油大学计算机科学学院,成都,610500
  • 收稿日期:2023-08-10 出版日期:2024-01-30 发布日期:2024-01-29
  • 通讯作者: 王欣 E-mail:xinwang@swpu.edu.cn
  • 基金资助:
    国家自然科学基金(62172102);四川省科技创新人才基金(2022JDRC0009)

Top⁃Rank⁃K frequent pattern mining algorithm for large graphs

Jiejun Zou, Xin Wang(), Junhao Shi, Zhuo Lan, Yu Fang, Chong Zhang, Wenbo Xie, Lingzhen Shen   

  1. School of Computer Science,Southwest Petroleum University,Chengdu,610500,China
  • Received:2023-08-10 Online:2024-01-30 Published:2024-01-29
  • Contact: Xin Wang E-mail:xinwang@swpu.edu.cn

摘要:

频繁模式挖掘(Frequent Pattern Mining,FPM)在社交分析中扮演重要角色,能从海量社交数据中挖掘用户行为的模式和规律,为社交网络的研究提供新的认识和决策支持.然而,对于一个FPM任务,设置一个合适的支持度阈值不容易;此外,FPM作为一个NP?hard问题,不存在多项式时间的算法.针对上述问题,提出一种无须用户设置初始支持度阈值的算法ItrMiner.该算法使用一种新的兴趣度指标对模式进行评估,综合考虑模式的大小和支持度,挖掘Top?Rank?K频繁模式.同时,为了解决去除初始支持度阈值后在算法剪枝阶段遇到的困难,提出基于树模式优先识别的策略和模式扩展约束策略,减少非必要候选模式的生成.在真实图和人工合成图数据集上进行了广泛的实验,证明ItrMiner在执行效率和可扩展性方面表现出色,尤其在稠密的数据集上,其时间开销仅为基线算法Top?K Graph Miner的13.2%.另外,提出的模式扩展约束策略的有效性较高,和无扩展约束优化的ItrMinernopt算法相比,效率提升最高可达9.5倍.

关键词: 频繁模式挖掘, 社交分析, 支持度阈值, 兴趣度

Abstract:

Frequent Pattern Mining (FPM) plays a crucial role in social analytics,which mines patterns and regularities about users' behaviour from vast amounts of social data,thereby provides new insights and decision support for research on social networks. However,it is not easy to set an appropriate support threshold for an FPM task. Moreover,as an NP?hard problem,FPM does not exist a polynomial?time algorithm. To address these problems,an algorithm that does not require an initial support threshold,denoted by ItrMiner,is proposed. In ItrMiner,a new metric which considers both size and support is proposed for measuring the interestingness of a pattern,and this metric is utilized to mine the Top?Rank?K frequent patterns. Meanwhile,to overcome the difficulty caused by the lack of initial support threshold,a pattern?recognition?based strategy and a pattern expansion strategy are proposed,which reduces excessive redundant candidate patterns. Extensive experiments on real and synthetic graphs show that ItrMiner performs well in terms of efficiency and scalability,especially on dense datasets,with a time overhead of only 13.2% of the baseline algorithm Top?K Graph Miner. Furthermore,the pattern expansion strategy is quite effective,which performs up to 9.5 times faster than the counterpart of ItrMinernopt without optimization.

Key words: frequent pattern mining, social analytics, support threshold, interestingness

中图分类号: 

  • TP311

图1

社交网络示例图和在不同支持度阈值设定下的频繁模式"

图2

样本图、模式、模式定义域及其模式匹配"

表1

本文使用的符号"

符号含义
G=V,E,L一个原始数据图
Q=Vp,Ep,fv一个模式
GsGGsG的子图
QQ'Q'Q的子模式
M(Q,G)QG中的匹配集
Q模式Q的大小
SupQ,GQG中的支持度
DQ,GQG中的域
Itr(Q)模式Q的兴趣度

图3

编码树生成和队列更新"

表2

种子模式的支持度,兴趣度和rank排名"

模式支持度兴趣度rank排名
Q62013.401
Q11610.722
Q41510.053
Q3106.704
Q7106.704
Q2106.704
Q553.355
Q810.676

图4

未带约束的模式扩展编码树"

表 3

实验使用的数据集"

数据集点集大小边集大小平均度数
Amazon410236335682416.36
MiCo100000108029821.61
YouTube154817105557213.63
Twitter81306176814943.49
CiteSeer331245912.77
PDB20226833568.24

图5

在六个数据集上k对各算法执行时间的的影响"

图6

在六个数据集上k对各算法内存消耗的的影响"

表4

Grami和ItrMiner算法在六个数据集上的执行时间 (s)"

算法AmazonMiCoYouTubeTwitterCiteSeerPDB
k=100k=300k=100k=300k=100k=300k=100k=300k=100k=300k=100k=300
Grami55.41217.0815.6959.3727.93123.113.2379.780.920.943.7858.61
ItrMiner36.0274.1917.3859.9512.6732.1215.64173.40.470.423.2841.79

表5

Grami和ItrMiner算法在六个数据集上的内存消耗 (GB)"

算法AmazonMiCoYouTubeTwitterCiteSeerPDB
k=100k=300k=100k=300k=100k=300k=100k=300k=100k=300k=100k=300
Grami3.816.801.963.461.673.831.812.780.210.210.741.58
ItrMiner3.304.971.733.021.483.162.033.260.180.181.131.98

表6

Grami和ItrMiner算法在六个数据集上的执行时间 (s)"

算法AmazonMiCoYouTubeTwitterCiteSeerPDB
阈值降低50阈值降低50阈值降低50阈值降低150阈值不变阈值降低20
k=100k=300k=100k=300k=100k=300k=100k=300k=100k=300k=100k=300
Grami61.90244.823.72101.131.93160.620.23266.70.920.945.78110.6
ItrMiner36.0274.1917.3859.9512.6732.1215.64173.40.470.423.2841.79

图7

算法的可扩展性"

图8

六个数据集上α对ItrMiner算法的影响"

1 Daud N N, Ab Hamid S H, Saadoon M,et al. Applications of link prediction in social networks:A review. Journal of Network and Computer Applications2020,166:102716.
2 Sabe V T, Ntombela T, Jhamba L A,et al. Current trends in computer aided drug design and a highlight of drugs discovered via computational techniques:A review. European Journal of Medicinal Chemistry2021,224:113705.
3 Xue Y, Klabjan D, Luo Y. Predicting ICU readmission using grouped physiological and medication trends. Artificial Intelligence in Medicine2019,95:27-37.
4 Yang T, Hou B N, Cai Z P,et al. 6Graph:A graph?theoretic approach to address pattern mining for Internet?wide IPv6 scanning. Computer Networks2022,203:108666.
5 Huan J, Wang W, Prins J,et al. SPIN:Mining maximal frequent subgraphs from graph databases∥Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Seattle,WA,USA:ACM,2004:581-586.
6 Deng Z H. Fast mining Top?Rank?K frequent patterns by using node?lists. Expert Systems with Applications201441(4):1763-1768.
7 Huynh?Thi?Le Q, Le T, Vo B,et al. An efficient and effective algorithm for mining Top?Rank?K frequent patterns. Expert Systems with Applications201542(1):156-164.
8 Chen W, Liu J, Chen Z Y,et al. PBSM:An efficient top?k subgraph matching algorithm. International Journal of Pattern Recognition and Artificial Intelligence201832(6):1850020.
9 Natarajan D, Ranu S. A scalable and generic framework to mine top?k representative subgraph patterns∥2016 IEEE 16th International Conference on Data Mining. Barcelona,Spain:IEEE,2016:370-379.
10 Wang X, Tang L, Liu Y,et al. Diversified pattern mining on large graphs∥The 32nd International Conference on Database and Expert Systems Applications. Springer Berlin Heidelberg,2021:171-184.
11 Deng Z H, Fang G D. Mining Top?Rank?K frequent patterns∥2007 International Conference on Machine Learning and Cybernetics. Hong Kong,China:IEEE,2007:851-856.
12 Abdelaal A A, Abed S, Al?Shayeji M,et al. Customized frequent patterns mining algorithms for enhanced Top?Rank?K frequent pattern mining. Expert Systems with Applications2021,169:114530.
13 Goyal N, Jain S K. An efficient algorithm for mining Top?Rank?K frequent patterns from uncertain databases∥2016 2nd International Conference on Applied and Theoretical Computing and Communication Technology. Bangalore,India:IEEE,2016:324-328.
14 Nguyen G, Le T, Vo B,et al. A new approach for mining Top?Rank?K erasable itemsets∥The 6th Asian Conference on Intelligent Information and Database Systems. Springer Berlin Heidelberg,2014:73-82.
15 Kuramochi M, Karypis G. Finding frequent patterns in a large sparse graph. Data Mining and Knowledge Discovery200511(3):243-271.
16 Elseidy M, Abdelhamid E, Skiadopoulos S,et al. GraMi:Frequent subgraph and pattern mining in a single large graph. Proceedings of the VLDB Endowment20147(7):517-528.
17 Teixeira C H C, Fonseca A J, Serafini M,et al. Arabesque:A system for distributed graph mining∥Proceedings of the 25th Symposium on Operating Systems Principles. Monterey,CA,USA:ACM,2015:425-440.
18 Dias V, Teixeira C H C, Guedes D,et al. Fractal:A general?purpose graph pattern mining system∥Proceedings of 2019 International Conference on Management of Data. Amsterdam,Netherlands:ACM,2019:1357-1374.
19 Abdelhamid E, Abdelaziz I, Kalnis P,et al. Scalemine:Scalable parallel frequent subgraph mining in a single large graph∥Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis. Salt Lake City,UT,USA:IEEE,2016:716-727.
20 Chen H Z, Liu M, Zhao Y J,et al. G?miner:An efficient task?oriented graph mining system∥Proceedings of the 13th EuroSys Conference. Porto,Portugal:ACM,2018:32.
21 李玲,印莹,赵宇海,等. 基于解耦概要图的大规模图数据高效分布式挖掘算法. 计算机学报202043(7):1183-1198.
Li L, Yin Y, Zhao Y H,et al. An efficient distributed algorithm for large?scale graph data mining based on decoupled summary subgraph. Chinese Journal of Computers202043(7):1183-1198.
22 Wang X, Lan Z, He Y A,et al. A cost?effective approach for mining near?optimal top?k patterns. Expert Systems with Applications2022,202:117262.
23 Li Y H, Lin Q, Li R X,et al. TGP:Mining top?k frequent closed graph pattern without minimum support∥The 6th International Conference on Advanced Data Mining and Applications. Springer Berlin Heidelberg,2010:537-548.
24 Saha T K, Al Hasan M. FS3:A sampling based method for top‐k frequent subgraph mining. Statistical Analysis and Data Mining:The ASA Data Science Journal20158(4):245-261.
25 Fournier?Viger P, Cheng C, Lin J C W,et al. Tkg:Efficient mining of top?k frequent subgraphs∥The 7th International Conference on Big Data Analytics. Springer Berlin Heidelberg,2019:209-226.
26 Zeng J, Leong H U, Yan X,et al. Fast core?based top?k frequent pattern discovery in knowledge graphs∥2021 IEEE 37th International Conference on Data Engineering. Chania,Greece:IEEE,2021:936-947.
27 Bringmann B, Nijssen S. What is frequent in a single graph?∥The 12th Pacific?Asia Conference on Knowledge Discovery and Data Mining. Springer Berlin Heidelberg,2008:858-863.
28 Cordella L P, Foggia P, Sansone C,et al. A (sub) graph isomorphism algorithm for matching large graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence200426(10):1367-1372.
29 Leskovec J, Adamic L A, Huberman B A. The dynamics of viral marketing. ACM Transactions on the Web20071(1):5.
30 Cheng X, Dale C, Liu J C. Statistics and social network of YouTube videos∥2008 16th Interntional Workshop on Quality of Service. Enschede,Netherlands:IEEE,2008:229-238.
31 Mcauley J, Leskovec J. Learning to discover social circles in ego networks∥Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe,NV,USA:Curran Associates Inc.,2012:539-547.
32 Talukder N, Zaki M J. A distributed approach for graph mining in massive networks. Data Mining and Knowledge Discovery201630(5):1024-1052.
[1]  韦素云1**,业宁1,吉根林2,张丹丹1,殷晓飞1
.  基于项目类别和兴趣度的协同过滤推荐算法*[J]. 南京大学学报(自然科学版), 2013, 49(2): 142-149.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!