南京大学学报(自然科学版) ›› 2019, Vol. 55 ›› Issue (6): 984–999.doi: 10.13232/j.cnki.jnju.2019.06.011

• • 上一篇    下一篇

一种基于相对熵的随机游走相似性度量模型

郑文萍1,2,3(),刘韶倩1,2,3,穆俊芳1,2,3   

  1. 1. 山西大学计算机与信息技术学院,太原,030006
    2. 计算智能与中文信息处理教育部重点实验室,山西大学,太原,030006
    3. 山西大学大数据科学与产业研究院,太原,030006
  • 收稿日期:2019-07-17 出版日期:2019-11-30 发布日期:2019-11-29
  • 通讯作者: 郑文萍 E-mail:wpzheng@sxu.edu.cn
  • 基金资助:
    山西省自然科学基金(201801D121123);山西省回国留学人员科研基金(2017?014)

A random walk similarity measure model based on relative entropy

Wenping Zheng1,2,3(),Shaoqian Liu1,2,3,Junfang Mu1,2,3   

  1. 1. School of Computer and Information Technology,Shanxi University,Taiyuan,030006,China
    2. Key Laboratory of Computation Intelligence and Chinese Information Processing,Ministry of Education,Shanxi University,Taiyuan,030006,China
    3. Research Institute of Big Data Science and Industry,Shanxi University,Taiyuan,030006,China
  • Received:2019-07-17 Online:2019-11-30 Published:2019-11-29
  • Contact: Wenping Zheng E-mail:wpzheng@sxu.edu.cn

摘要:

针对基于随机游走的节点相似性度量模型中存在的大度节点依赖问题,从信息论的角度提出了一种改进的随机游走节点相似性度量

方法

基于相对熵的随机游走相似性度量方法RE?model (A random walk similarity measure model based on Relative Entropy).首先根据随机游走模型得到网络中节点的转移概率向量,再计算两个节点转移概率向量的相对熵得到该节点对的相似性.由于转移概率向量给出了从一个特定节点出发经过多步随机游走后到达网络其他所有节点的概率,导致网络中的每个节点在计算相对熵的过程中都被等同看待,并且网络规模的增大会使计算得到的节点间相似性耗时更多且存在较大偏差.根据节点经过多步随机游走后到达网络中影响力较大的节点的转移概率来构造该节点的转移概率分布,计算两个节点的转移概率分布的相对熵以得到网络中节点对之间的差异分数,进而得到网络节点间的相似性矩阵.RE?model度量方法降低了传统随机游走相似性度量对于大度节点的依赖性.通过在真实网络数据集上的实验表明,RE?model算法在对称性、网络传播及社区发现等方面表现良好.

关键词: 复杂网络, 相对熵, 节点相似性度量, 随机游走

Abstract:

When measuring nodes similarity based on the traditional random walk similarity metrics,nodes tend to be similar to nodes with large degrees. In order to solve the problem,a random walk similarity measure model based on relative entropy is proposed from the perspective of information theory,abbreviated as RE?model. Firstly,we generate a transition probability set for each node based on random walk measures and calculate the relative entropy between the probability sets of each pair of nodes,then we obtain similarity of each node pair. Since the transition probability sets contain the probability of reaching all nodes after a certain number of steps from the target node,it might be more time?consuming with the network scale increasing and may result in evident deviation when calculating the relative entropy between the transition probability sets. Therefore,we construct the transition probability distribution for each node based on the transition probability of reaching nodes with large degree after a few steps in random walk,and calculate relative entropy between probability distribution of each pair of node to measure nodes similarity. RE?model method reduces their sensitive dependence to nodes with large degree. We compare traditional random walk measures over 22 real network datasets,and show the proposed RE?model measure outperforms the other measures in terms of symmetry,network spreading and community detection on most networks.

Key words: complex network, relative entropy, node similarity, random walk

中图分类号: 

  • TP181

表1

真实网络上基于不同随机游走的相似性度量的对称性和时间对比"

Les socfb?Haverford76 socfb?Amherst41
MS Time (s) MS Time (s) MS Time (s)
RW 0.0260 0.4560 0.0014 26.3472 0.0018 78.6891
BRW 0.0779 0.3298 0.0028 222.5838 0.0027 587.5432
RWR 0.4415 0.2911 0.1715 4.3054 0.1861 15.8577
MERW 0.0260 0.9219 0.0014 12.0781 0.0009 50.7344
LRW 0.0260 0.2678 0.0055 4.5206 0.0107 20.9674

表2

真实网络上基于RE?model的不同随机游走相似性度量的对称性和时间对比"

Les socfb?Haverford76 socfb?Amherst41
r=n (116) r=20 r=n(1446) r=100 r=n (2235) r=100
MS Time (s) MS Time (s) MS Time(s) MS Time (s) MS Time (s) MS Time (s)
RE_RW 0.3750 3.0043 0.4821 0.6661 0.5349 2825.2925 0.5809 297.1971 0.5306 9677.3236 0.5682 584.6638
RE_BRW 0.5536 2.7627 0.5714 0.6036 0.42446 2830.4139 0.4620 480.1075 0.4098 10702.9914 0.4268 1181.6620
RE_RWR 0.4643 1.6344 0.5536 0.5104 0.2638 3077.7294 0.2669 207.5168 0.2683 3077.7294 0.2747 483.1988
RE_MERW 0.3036 1.4926 0.5000 0.9682 0.0968 2594.9364 0.3887 234.6257 0.0986 2594.9364 0.3911 592.7081
RE_LRW 0.4156 0.7364 0.4156 0.4212 0.3112 2526.1862 0.3790 193.4195 0.2371 9401.5379 0.4018 456.9157

图1

Karate网络"

图2

Karate网络的转移概率矩阵"

表3

每个节点的概率分布"

vi p 3 ( v i ) vi p 3 ( v i )
1 [0.03280.13630.02740.15360.15090.20200.13860.1584] 18 [0.06000.24590.04040.09510.30920.16700.00820.0742]
2 [0.04080.13750.03770.18810.12860.26270.08620.1186] 19 [0.30010.04600.34750.08720.06700.02680.03930.0860]
3 [0.06830.12780.12140.07840.17180.18460.07490.1728] 20 [0.13820.18420.11270.10980.23620.12360.02550.0698]
4 [0.05120.16290.03500.17890.23240.17270.06090.1060] 21 [0.30010.04600.34750.08720.06700.02680.03930.0860]
5 [0.03070.38540.02250.08760.16960.16850.05640.0792] 22 [0.06000.24590.04040.09510.30920.16700.00820.0742]
6 [0.02680.46320.01970.07650.14810.14720.04920.0692] 23 [0.30010.04600.34750.08720.06700.02680.03930.0860]
7 [0.02680.46320.01970.07650.14810.14720.04920.0692] 24 [0.24690.04500.27740.05420.04650.03140.19090.1077]
8 [0.06860.18340.03050.15850.21800.23730.03810.0656] 25 [0.10250.03630.13330.05760.02840.04260.49280.1065]
9 [0.18460.12510.18020.16410.10280.10390.06420.0751] 26 [0.13710.02490.07600.16490.00830.01250.48660.0897]
10 [0.22070.06890.14490.23300.09590.09350.09130.0517] 27 [0.32420.03410.23570.11640.05270.02020.10310.1136]
11 [0.03070.38540.02250.08760.16960.16850.05640.0792] 28 [0.20510.05840.13420.21250.05880.05730.21550.0581]
12 [0.05570.34150.04090.09760.23960.20360.00000.0210] 29 [0.17670.04840.10610.20600.07340.07490.24190.0726]
13 [0.03830.23510.02770.11760.18570.33460.01260.0484] 30 [0.26360.02930.31310.08950.04270.01710.13090.1138]
14 [0.11280.15880.07470.15540.19210.19890.04270.0646] 31 [0.17590.08030.20210.08080.17670.07610.04870.1593]
15 [0.30010.04600.34750.08720.06700.02680.03930.0860] 32 [0.19300.12520.22640.08130.08550.06830.14690.0734]
16 [0.30010.04600.34750.08720.06700.02680.03930.0860] 33 [0.13420.02840.10240.15120.04280.04500.25960.2364]
17 [0.00000.43930.00000.07530.08370.12550.12550.1506] 34 [0.10490.03670.14490.09180.05010.07100.23900.2615]

图3

Karate网络的差异矩阵"

图4

Karate网络的相似矩阵"

表4

Karate网络中基于不同相似性度量的最相似节点对比表"

vi RW RE_RW RWR RE_RWR BRW RE_BRW MERW RE_MERW LRW RE_LRW
1 3 13 12 5 2 8 3 2 3 2
2 1 4 18 18 1 4 1 4 1 1
3 1 20 10 10 1 14 1 17 1 14
4 1 8 13 13 1 18 1 8 1 8
5 1 11 11 6 1 11 1 11 1 11
6 1 7 17 5 1 7 1 7 1 7
7 1 6 17 1 1 6 1 6 1 6
8 1 4 4 14 1 14 1 4 1 4
9 34 31 31 31 33 31 34 32 34 10
10 34 31 3 28 34 17 34 28 34 9
11 1 5 5 12 1 5 1 5 1 5
12 1 13 1 11 1 11 1 6 1 18
13 1 18 4 8 1 2 1 18 1 8
14 1 20 4 8 1 8 1 20 1 8
15 34 16 34 16 34 16 34 16 34 16
16 34 15 34 15 34 15 34 15 34 15
17 6 7 6 1 2 1 1 1 6 6
18 1 22 2 22 1 22 1 22 1 22
19 34 21 34 21 34 21 34 21 34 21
20 34 14 2 14 1 17 1 14 1 1
21 34 23 34 19 34 23 34 23 34 23
22 1 18 26 18 1 18 1 18 1 18
23 34 19 26 19 34 19 34 19 34 19
24 34 27 25 30 34 30 34 30 34 30
25 34 26 30 26 33 26 34 34 26 26
26 24 25 25 25 33 25 24 25 24 25
27 34 30 30 34 34 30 34 30 34 30
28 34 34 25 10 34 29 34 29 34 29
29 34 32 32 28 34 28 34 28 34 28
30 34 27 27 15 34 24 34 24 34 27
31 34 10 9 24 34 32 34 32 34 9
32 34 29 25 26 34 31 34 9 34 9
33 34 34 15 15 34 28 34 34 34 34
34 1 33 27 27 33 15 33 25 33 33

表5

真实网络数据集"

Dataset |V| |E| ΔG

参考

社区数

Karate 34 78 17 2
Dolphins 62 159 12 2
Les 77 254 36 -
Polbooks 105 441 25 3
Football 115 613 12 12
NetScience 379 914 34 -
Email 1133 5451 71 -
Yeast 1458 1948 56 -
Gavin 1727 7534 48 -
socfb?Reed98 962 18812 313 -
socfb?Haverford76 1446 59589 375 -
socfb?Amherst41 2235 90954 467 -
socfb?Trinity100 2613 111996 404 -
socfb?Oberlin44 2920 89912 478 -
socfb?Smith60 2970 151747 349 -
socfb?Wellesley22 2970 94899 746 -
socfb?Vassar85 3068 119161 482 -
socfb?Colgate88 3482 155043 773 -
socfb?Santa74 3578 151747 1129 -
socfb?Howard90 4047 204850 1215 -
socfb?MIT 6402 251230 708 -
socfb?William77 6472 266378 1124 -

图5

socfb_Reed98网络中基于五种现有相似性度量所对应的散点图"

图6

socfb_Reed98网络中基于RE?model的五种相似性度量所对应的散点图"

表6

Facebook网络中不同相似性度量的对称性对比表"

Dataset RW RE_RW BRW

RE_

BRW

RWR

RE_

RWR

MERW

RE_

MERW

LRW

RE_

LRW

socfb?Reed98 0.0021 0.5114 0.0021 0.4449 0.2997 0.3472 0.0021 0.3763 0.0166 0.3347
socfb?Haverford76 0.0014 0.5809 0.0028 0.4620 0.1715 0.2669 0.0014 0.3887 0.0055 0.3790
socfb?Amherst41 0.0018 0.5682 0.0027 0.4268 0.1861 0.2747 0.0009 0.3911 0.0107 0.4018
socfb?Trinity100 0.0015 0.5779 0.0031 0.4554 0.1768 0.2985 0.0015 0.4118 0.0100 0.3827
socfb?Oberlin44 0.0007 0.5390 0.0021 0.4027 0.2384 0.2959 0.0007 0.3678 0.0144 0.3589
socfb?Smith60 0.0020 0.5347 0.0108 0.4027 0.2182 0.3044 0.0007 0.3872 0.0229 0.3576
socfb?Wellesley22 0.0013 0.5455 0.0020 0.3845 0.2000 0.2990 0.0007 0.3475 0.0189 0.3380
socfb?Vassar85 0.0013 0.5847 0.0026 0.3859 0.1669 0.2894 0.0007 0.3494 0.0065 0.2960
socfb?Colgate88 0.0006 0.5526 0.0017 0.4394 0.1722 0.3056 0.0006 0.4120 0.0045 0.3785
socfb?Santa74 0.0006 0.5735 0.0052 0.4125 0.1648 0.3181 0.0006 0.3728 0.0046 0.3605
socfb?Howard90 0.0005 0.5441 0.0010 0.3988 0.2066 0.2392 0.0005 0.3113 0.0074 0.3020
socfb?MIT 0.0009 0.5386 0.0012 0.3714 0.2755 0.3302 0.0003 0.3246 0.0244 0.3083
socfb?William77 0.0015 0.5309 0.0046 0.3322 0.1851 0.3044 0.0003 0.2960 0.0111 0.2979

图7

Facebook网络中节点感染能力的方差对比图"

图8

Facebook网络中节点恢复能力的方差对比图"

表7

有标签真实网络中不同相似性度量的社区发现实验结果"

Dataset Index RW RE_RW RWR

RE_

RWR

MERW

RE_

MERW

LRW

RE_

LRW

BRW

RE_

BRW

Karate k 5 4 4 2 2 11 4 3 4 2
NMI 0.5154 0.6657 0.5144 0.7308 0.6766 0.4527 0.7236 0.8384 0.6201 1.0000
ARI 0.3015 0.6372 0.3897 0.7718 0.7716 0.1808 0.5414 0.9027 0.4841 1.0000
Dolphins k 13 13 13 26 30 13 7 5 6 4
NMI 0.4876 0.4585 0.4195 0.4301 0.4311 0.4585 0.4594 0.5426 0.5983 0.6409
ARI 0.1881 0.1025 0.1345 0.0859 0.0620 0.1025 0.2584 0.2799 0.2719 0.4902
Polbooks k 7 3 6 3 3 4 4 3 3 4
NMI 0.4916 0.5561 0.5227 0.5606 0.5695 0.5455 0.5381 0.5405 0.5234 0.5348
ARI 0.3444 0.6628 0.4552 0.6717 0.6793 0.6412 0.6288 0.6389 0.5040 0.6698
Football k 12 11 8 13 13 13 10 10 14 13
NMI 0.8910 0.8326 0.7561 0.8035 0.8860 0.8464 0.8879 0.8754 0.9259 0.9321
ARI 0.8273 0.7188 0.5745 0.6507 0.8280 0.7449 0.8154 0.7831 0.8728 0.8984

表8

无标签真实网络中不同相似性度量的社区发现实验结果 on real networks without labels"

Dataset Index RW RE_RW RWR

RE_

RWR

MERW

RE_

MERW

LRW

RE_

LRW

BRW

RE_

BRW

Les Q 0.3065 0.3076 0.3171 0.3482 0.3547 0.3076 0.3975 0.4637 0.4485 0.4652
NetScience Q 0.7874 0.8076 0.8040 0.7754 0.6820 0.7823 0.8076 0.8606 0.8126 0.8412
Email Q 0.4972 0.4026 0.3774 0.3753 0.5017 0.4044 0.4026 0.4619 0.4187 0.4836
Yeast Q 0.8015 0.6872 0.8509 0.8294 0.8422 0.8030 0.6872 0.8030 0.7377 0.8030
Gavin Q 0.4524 0.5565 0.5172 0.4486 0.5445 0.5565 0.4725 0.5565 0.4009 0.5565
1 Zhu X Y , Yang X M , Ying C Z ,et al . A new classification algorithm recommendation method based on link prediction. Knowledge?Based Systems,2018,159:171-185.
2 Zhang J P , Ding X Y , Yang J . Revealing the role of node similarity and community merging in community detection. Knowledge?Based Systems,2019,165:407-419
3 Ding J Y , Jiao L C , Wu J S ,et al . Prediction of missing links based on community relevance and ruler inference. Knowledge?Based Systems,2016,98:200-215.
4 Wang Z Q , Liang J Y , Li R . A fusion probability matrix factorization framework for link prediction. Knowledge?Based Systems,2018,159:72-85.
5 Bastami E , Mahabadi A , Taghizadeh E . A gravitation?based link prediction approach in social networks. Swarm and Evolutionary Computation,2019,44:176-186.
6 汪小帆,李翔,陈关荣 . 网络科学导论. 北京:高等教育出版社,2012,397.
Wang X F , Li X , Chen G R . Network science:an introduction. Beijing:Higher Education Press,2012,397.
7 Newman M E J . 网络科学引论. 郭世泽,陈哲译. 北京:电子工业出版社,2014,496.
8 Jaccard P . étude comparative de la distribution florale dans une portion des Alpes et du Jura. Bulletin de la Société Vaudoise des Sciences Naturelles,1901,37(142):547-579.
9 Salton G , Mcgill M J . Introduction to modern information retrieval. New York:Mcgraw?Hill Book Co.,1983,120-127.
10 Katz L . A new status index derived from sociometric analysis. Psychometrika,1953,18(1):39-43.
11 Lü L Y , Zhou T . Link prediction in complex networks:a survey. Physica A:Statistical Mechanics and its Applications,2011,390(6):1150-1170.
12 Martínez V , Berzal F , Cubero J C . A survey of link prediction in complex networks. ACM Computing Surveys,2016,49(4):69.
13 Pearson K . The problem of the random walk. Nature,1905,72(1865):294.
14 Zhou H J , Lipowsky R . Network Brownian motion:a new method to measure vertex?vertex proximity and to identify communities and subcommunities∥The 4th international conference on computational science. Springer Berlin Heidelberg,2004:1062-1069.
15 Javed M A , Younis M S , Latif S ,et al . Community detection in networks:A multidisciplinary review. Journal of Network and Computer Applications,2018,108:87-111.
16 Tong H H , Faloutsos C , Pan J Y . Fast random walk with restart and its applications∥The 6th international conference on data mining. Hong Kong,China:IEEE,2006:613-622.
17 Burda Z , Duda J , Luck J M ,et al . Localization of the maximal entropy random walk. Physical Review Letters,2012,102(16):160602.
18 Liu W P , Lü L Y . Link prediction based on local random walk. Europhysics Letters,2010,89(5):58007.
19 Tan F , Xia Y X , Zhu B Y . Link prediction in complex networks:a mutual information perspective. PLOS One,2014,9(9):e107056.
20 Zhang Q , Li M Z , Deng Y . Measure the structure similarity of nodes in complex networks based on relative entropy. Physica A:Statistical Mechanics and its Applications,2018,491:749-763.
21 George M , Jafarpour S , Bullo F . Markov chains with maximum entropy for robotic surveillance. IEEE Transactions on Automatic Control,2019,64(4):1566-1580.
22 Kullback S , Leibler R A . On information and sufficiency. The Annals of Mathematical Statistics,1951,22(1):79-86.
23 Albert R , Jeong H , Barabási A L . Diameter of the world?wide web. Nature,1999,401(6):130-131.
24 Newman M E . Spread of epidemic disease on networks. Physical Review E,2002,66(1):016128.
25 Danon L , Díaz?Guilera A , Duch J ,et al . Comparing community structure identification. Journal of Statistical Mechanics:Theory and Experiment,2005,2005(9):09008.
26 Rand W M . Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association,1971,66(336):846-850.
27 Newman M E J , Girvan M . Finding and evaluating community structure in networks. Physical Review E,2004,69(2):026113.
28 Pons P , Latapy M . Computing communities in large networks using random walks. Journal of Graph Algorithms and Applications,2006,10(2):191-218.
[1] 刘胜久,李天瑞,珠杰,刘佳. 带权图的多重分形研究[J]. 南京大学学报(自然科学版), 2020, 56(1): 85-97.
[2] 马娜, 范敏, 李金海. 复杂网络下的概念认知学习[J]. 南京大学学报(自然科学版), 2019, 55(4): 609-623.
[3] 钱付兰, 黄鑫, 赵姝, 张燕平. 基于路径相互关注的网络嵌入算法[J]. 南京大学学报(自然科学版), 2019, 55(4): 573-580.
[4] 钱 峰1,2,张 蕾1,2,赵 姝1*,陈 洁1,张燕平1. 基于加权树的层次社团划分算法[J]. 南京大学学报(自然科学版), 2018, 54(4): 696-.
[5] 朱 尧, 朱启海, 毛晓蛟, 杨育彬. 基于有监督显著性检测的目标跟踪[J]. 南京大学学报(自然科学版), 2017, 53(4): 747-.
[6]  张泽华1*,段力畑1,段 富1,张 楠2.  基于局部结构特征的重叠社区挖掘研究进展[J]. 南京大学学报(自然科学版), 2017, 53(3): 537-.
[7] 宾 晟*,孙更新. 基于多子网复合复杂网络模型的多关系社交网络重要节点发现算法[J]. 南京大学学报(自然科学版), 2017, 53(2): 378-.
[8] 梁晋1,2,梁吉业1,2*,赵兴旺1,2. 一种面向大规模社会网络的社区发现算法[J]. 南京大学学报(自然科学版), 2016, 52(1): 159-166.
[9] 曹江中1*,陈 佩2,戴青云3,凌永权1. 基于Markov随机游走的谱聚类相似图构造方法[J]. 南京大学学报(自然科学版), 2015, 51(4): 772-780.
[10] 施静静,张鹏,阮雅端,陈启美*. 多媒体信息网络相似度计算方法研究[J]. 南京大学学报(自然科学版), 2015, 51(2): 290-296.
[11] 张燕平1,2,汪  洋1,2,赵  姝1,2, 段  震1,2**. 基于覆盖的社团发现算法[J]. 南京大学学报(自然科学版), 2013, 49(5): 539-545.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 缪长健, 施斌, 郑兴, 王湛, 魏广庆. 海上超长PHC管桩BOFDA内力测试[J]. 南京大学学报(自然科学版), 2018, 54(6): 1057 -1063 .
[2] 段新春,施 斌,孙梦雅,魏广庆,顾 凯,冯晨曦. FBG蒸发式湿度计研制及其响应特性研究[J]. 南京大学学报(自然科学版), 2018, 54(6): 1075 -1084 .
[3] 许 林,张 巍*,梁小龙,肖 瑞,曹剑秋. 岩土介质孔隙结构参数灰色关联度分析[J]. 南京大学学报(自然科学版), 2018, 54(6): 1105 -1113 .
[4] 胡 淼,王开军,李海超,陈黎飞. 模糊树节点的随机森林与异常点检测[J]. 南京大学学报(自然科学版), 2018, 54(6): 1141 -1151 .
[5] 孔 颉, 孙权森, 纪则轩, 刘亚洲. 基于仿射不变离散哈希的遥感图像快速目标检测新方法[J]. 南京大学学报(自然科学版), 2019, 55(1): 49 -60 .
[6] 胡 太, 杨 明. 结合目标检测的小目标语义分割算法[J]. 南京大学学报(自然科学版), 2019, 55(1): 73 -84 .
[7] 徐秀芳, 徐 森, 花小朋, 徐 静, 皋 军, 安 晶. 一种基于t-分布随机近邻嵌入的文本聚类方法[J]. 南京大学学报(自然科学版), 2019, 55(2): 264 -271 .
[8] 谢探春, 王国兵, 尹 颖, 郭红岩. 柳树对镉-芘复合污染土壤的修复潜力与耐受性研究[J]. 南京大学学报(自然科学版), 2019, 55(2): 282 -290 .
[9] 曹 群,陈蓓蓓,宫辉力,周超凡,罗 勇,高明亮,王 旭,史 珉,赵笑笑,左俊杰. 基于SBAS和IPTA技术的京津冀地区地面沉降监测[J]. 南京大学学报(自然科学版), 2019, 55(3): 381 -391 .
[10] 吕海敏,沈水龙,严学新,史玉金,许烨霜. 上海地面沉降对轨道交通安全运营风险评估[J]. 南京大学学报(自然科学版), 2019, 55(3): 392 -400 .