属性组序下基于代价敏感的约简方法

图1 算法流程图

Fig.1 The flowchart of our algorithm

第二步中，如何从局部组中选择合适的属性作为约简结果至关重要．如果按照属性组序依次删除分组是遵循属性语义上的偏好，则在局部分组中选择通过统计和客观评价信息，也就是属性重要度的方式来优先选择属性.首先选择组中核属性，其次利用加权的方式，选择重要程度最大的属性加入约简子集中．属性重要度作为信息表的统计信息，代价作为现实生活中存在的客观评判标准，都可以对各分组中的属性进行客观的评价．Zhang and Shen^[25]综合考虑用户需求，设计了一种属性重要度和代价加权的约简算法(Attribute Reduction algorithm with Weight，ARWW)，其代价重要度由单个属性代价占据总代价的比例决定，而对于分组来讲，同组内属性代价的重要度也会随着候选属性子集增加或减少而动态变化．下面介绍属性重要度的相关定义．

定义9 属性依赖度^[8] 给定决策表 $D T$ ， $\forall B \subseteq C$ ， $D$ 相对于 $B$ 的依赖度为：

γ_{B} (D) = \frac{|P O S_{B} (D)|}{|U|}

(7)

其中， $|\cdot|$ 表示集合的基数，并且 $0 \leq γ_{B} (D) \leq 1$ ．

定义10 属性内部重要度^[8] 给定决策表 $D T$ ， $B \subseteq C$ ， $\forall a \in B$ ，则属性 $a$ 的内部属性重要度为：

S i g_{i n n e r} (a, B, D) = γ_{B} (D) - γ_{B - \{a\}} (D)

(8)

属性 $a \in C$ 为必要属性当且仅当 $S i g_{i n n e r} (a, B, D) > 0$ ．

定义11 属性外部重要度^[8] 给定决策表 $D T$ ， $B \subseteq C$ ， $\forall a \in C - B$ ，则属性 $a$ 的外部属性重要度为：

S i g_{o u t e r} (a, B, D) = γ_{B ⋃ \{a\}} (D) - γ_{B} (D)

(9)

属性外部重要度表明属性 $a$ 对于集合 $B$ 的分辨能力的提升．

定义12 代价重要度给定决策表 $D T$ ，属性组序 $S = G_{n} > G_{n - 1} > \dots > G_{2} > G_{1}$ ， $t_{i}$ 为各个属性的代价， $T_{n}$ 为 $G_{n}$ 组属性集的总代价， $\forall a \in G_{m}$ ，则属性 $a_{i}$ 的代价重要度为：

S i g_{c o s t} (a_{i}) = 1 - \frac{t_{i}}{\sum_{n > m} T_{n} + \sum_{n \leq m} t_{k}}

(10)

其中， $t_{k}$ 为 $G_{n}$ 组中第二步添加的属性，此时 $n \leq m$ ．

代价重要度描述的是考虑获取属性所需代价的客观因素对属性产生的影响.对用户而言，代价越大其重要程度越低，因为人们通常会倾向代价更低的约简子集．

结合以上属性重要度的定义，定义加权重要度为：

S (a_{i}, B) = λ S i g_{c o s t} (a_{i}) + (1 - λ) S i g_{o u t e r} (a_{i}, B, D)

当 $λ = 0$ 时，为仅考虑属性重要度的方法，当 $λ = 1$ 时，为仅考虑代价重要度的方法．下面给出属性组序下基于代价敏感的属性约简算法( $A R C S G O$ )．

算法2 属性组序下基于代价敏感的约简方法（Attribute Reduction Based on Cost Sensitive under Attribute Group Order，ARCSGO）

输入：决策表 $D T$ ，属性组序 $S$ ， $λ$

输出：约简后属性集 $R$

(1)令 $R = C$ ；

（2）依照组序 $S$ ，依次删除 $G_{i}$ ， $1 \leq i \leq n$ ，当 $P O S_{R} (D) \neq P O S_{C} (D)$ ，执行循环；

①对于任意属性 $a \in G_{i}$ ，计算 $S i g_{i n n e r} (a, C, D)$ ，将 $S i g_{i n n e r} (a^{'}, C, D) > 0$ 的属性添加至 $R$ ，并从 $G_{i}$ 中删除；

②当 $P O S_{R} (D) \neq P O S_{C} (D)$ 时，跳转至(3)，否则退出循环；

③对于任意属性 $a \in G_{i}$ ，计算 $S (a_{i}, R)$ ，选择最大 $S (a_{i}, R)$ 的属性添加至 $R$ ，并从 $G_{i}$ 中删除，直到 $P O S_{R} (D) = P O S_{C} (D)$ ；

（3）对于任意属性 $a \in R$ ，判断 $S i g_{i n n e r} (a, R, D) > 0$ 是否大于0，大于0则保留，否则删除；

（4）返回约简集 $R$ ．

算法复杂度分析：设 $|U|$ 和 $|C|$ 分别代表决策表中样本数和条件属性数， $|G|$ 表示分组数．步骤（2）中依次删除分组 $G_{i}$ ，将执行 $|G|$ 次，①中计算内部重要度的时间复杂度为 $O (|C| {|U|}^{2})$ ，②中计算正域时间复杂度为 $O (|C| {|U|}^{2})$ ，③中计算外部重要度的时间复杂度为 $O (|C| {|U|}^{2})$ ，即步骤(2)的时间复杂度为 $O (|G| |C| {|U|}^{2})$ ．步骤（3）的时间复杂度为 $O (|C| {|U|}^{2})$ ．故算法的时间复杂度为 $O (|G| |C| {|U|}^{2})$ ．

下面通过一个实例来说明算法的执行过程．

如表1所示是一个决策表，该决策表包含五个条件属性 $\{a_{1}, a_{2}, a_{3}, a_{4}, a_{5}\}$ 和决策属性 $\{D\}$ ，条件属性所对应的代价为 $\{10,30,40,20,10\}$ ，利用算法2的约简方法，假设用户给定属性块序 $S = G_{3} > G_{2} > G_{1}$ ，其中 $G_{3} = \{a_{2}\}$ ， $G_{2} = \{a_{1}, a_{3}, a_{5}\}$ ， $G_{1} = \{a_{4}\}$ ， $λ = 0.5$ ，约简步骤如下：

表1 决策表

Table 1 A decision table

$a_{1}$	$a_{2}$	$a_{3}$	$a_{4}$	$a_{5}$	$D$
2	1	1	1	1	1
1	1	1	2	1	1
1	2	0	1	2	1
1	2	1	1	1	2
0	1	2	1	1	2
2	1	1	3	2	2
2	2	1	1	2	2

(1)初始化 $R = C$ ，从 $R$ 中删除 $G_{1}$ 组属性，此时 $P O S_{R} (D) = P O S_{C} (D)$ .

(2)从 $R$ 中删除 $G_{2}$ 组属性，此时 $P O S_{R} (D) \neq$

$P O S_{C} (D)$ ，则计算 $S i g_{i n n e r} (a_{1}, C, D) = 0$ ， $S i g_{i n n e r} (a_{3}, C, D) = 0$ ， $S i g_{i n n e r} (a_{5}, C, D) = 0$ ，则该组中不含核属性．计算 $S (a_{1}, G_{2}) = 0.65$ ， $S (a_{3}, G_{2}) = 0.56$ ， $S (a_{5}, G_{2}) = 0.58$ ，则选择 $a_{1}$ 属性加入至 $R$ 中，此时 $P O S_{R} (D) \neq P O S_{C} (D)$ ．再计算 $S (a_{3}, G_{2}) = 0.42$ ， $S (a_{5}, G_{2}) = 0.73$ ，选择 $a_{5}$ 属性加入至 $R$ 中，此时 $P O S_{R} (D) = P O S_{C} (D)$ .

(3)从 $R$ 删除 $G_{3}$ 组属性，此时 $P O S_{R} (D) \neq$

$P O S_{C} (D)$ ，则计算 $S i g_{i n n e r} (a_{2}, C, D) = 0$ ，所以 $a_{2}$ 不是核属性，计算 $S (a_{2}, G_{3}) = 0.34$ ，所以选择 $a_{2}$ 属性加入至 $R$ 中，最终 $P O S_{R} (D) = P O S_{C} (D)$ ．判断 $R$ 中属性是否为必要属性，最终得到约简 $A_{1} = \{a_{1}, a_{2}, a_{5}\}$ ，该约简的代价为 $50$ ．

同理，当 $λ = 0$ 时，得到的约简结果为 $A_{2} = \{a_{2}, a_{3}, a_{5}\}$ ，该约简的代价为 $80$ ，由此可以看出重要度加权的方式在局部组中也是有效的．分别计算上述算法中 $λ = 0.5$ 和 $λ = 0$ 时的约简子集 $S A P (A_{1}) = S A P (A_{2}) = 2.33$ ，表明这两个约简子集在该属性组序中的偏好程度相同．

属性组序是属性完全有序和完全无序的中间状态．当属性集完全无序时，即分组数为1，此时本文提出的算法退化为ARWW算法，^[25]，该算法利用属性重要度和代价加权的方式在全局寻找属性，得到不同权重时的约简子集．当属性集处于完全有序时，即分组数等于属性个数，此时算法仅根据用户偏好顺序依次删除属性，得到偏好程度较高约简集合，因为每个分组中仅存在一个属性，其局部选择属性的方法将失效.在实验分析中也验证了属性集随着分组数的增加，其参数 $λ$ 影响力的变化．

3 实验分析

为了验证本文提出算法的有效性，本文选取了五组UCI数据集进行实验，分别用 $a, b, c, d, e$ 代表Lymphography数据集、Lung⁃Cancer数据集、Dermatology数据集、Breast⁃Cancer⁃Wisconsin(BCW)数据集以及Connect⁃4数据集的条件属性．由于不同用户对属性有不同的偏好，对五组数据集采用随机不均等分组的方式，其中设置组数 $G = 3$ ， $λ$ 分别设置为 $0$ 和 $0.5$ 进行实验，下面仅列出Lymphography数据集的10种不同分组的方式(如表2所示)．

表2 Lymphography的分组方式

Table 2 The grouping methods of Lymphography dataset

$I D$	属性组序 $S = G_{3} > G_{2} > G_{1}$
1	$a_{16}, a_{12}, a_{6}, a_{8}, a_{10} > a_{3}, a_{11}, a_{14}, a_{18},$ $a_{1}, a_{2} > a_{5}, a_{17}, a_{4}, a_{13}, a_{7}, a_{15}, a_{9}$
2	$a_{2}, a_{11}, a_{12}, a_{3}, a_{16}, a_{9}, a_{6} > a_{13}, a_{17},$ $a_{10}, a_{7}, a_{14}, a_{5} > a_{8}, a_{1}, a_{4}, a_{15}, a_{18}$
3	$a_{15}, a_{14}, a_{1}, a_{10}, a_{12}, a_{11} > a_{5}, a_{18}, a_{9},$ $a_{17}, a_{2} > a_{8}, a_{3}, a_{4}, a_{13}, a_{7}, a_{16}, a_{6}$
4	$a_{9}, a_{3}, a_{12}, a_{18}, a_{1}, a_{8} > a_{5}, a_{15}, a_{2},$ $a_{14}, a_{4}, a_{13} > a_{6}, a_{10}, a_{7}, a_{17}, a_{11}, a_{16}$
5	$a_{9}, a_{13}, a_{18}, a_{5}, a_{11}, a_{7} > a_{15}, a_{12}, a_{14},$ $a_{2}, a_{6}, a_{17} > a_{8}, a_{3}, a_{16}, a_{10}, a_{4}, a_{1}$
6	$a_{17}, a_{16}, a_{10}, a_{12}, a_{13} > a_{7}, a_{2}, a_{5}, a_{3},$ $a_{4}, a_{18} > a_{11}, a_{14}, a_{9}, a_{1}, a_{8}, a_{15}, a_{6}$
7	$a_{6}, a_{2}, a_{17}, a_{3}, a_{8} > a_{11}, a_{16}, a_{15}, a_{7},$ $a_{10}, a_{18}, a_{4} > a_{5}, a_{14}, a_{1}, a_{13}, a_{9}, a_{12}$
8	$a_{18}, a_{3}, a_{11}, a_{10}, a_{15} > a_{12}, a_{9}, a_{2}, a_{17},$ $a_{7}, a_{5}, a_{6} > a_{1}, a_{13}, a_{8}, a_{16}, a_{14}, a_{4}$
9	$a_{3}, a_{8}, a_{4}, a_{11}, a_{10}, a_{18}, a_{9} > a_{15}, a_{17},$ $a_{12}, a_{7} > a_{16}, a_{14}, a_{2}, a_{6}, a_{5}, a_{13}, a_{1}$
10	$a_{15}, a_{16}, a_{3}, a_{1}, a_{4}, a_{5} > a_{13}, a_{17}, a_{9},$ $a_{12}, a_{8} > a_{6}, a_{12}, a_{2}, a_{7}, a_{18}, a_{10}, a_{14}$

通常情况下，属性的重要性与获取属性所需的成本呈正相关，因此本文采用属性重要度算法对属性进行代价的设置．例如在Lymphography数据集中，根据属性重要度算法得的到约简为 $\{a_{18}, a_{2}, a_{13}, a_{14}, a_{15}, a_{16}\}$ ，首先将不在约简集合中的属性代价设置为 $10$ ，其次按属性的重要度由低到高依次递增 $10$ ．即Lymphography数据集条件属性为 $\{a_{1}, a_{2}, a_{3}, a_{4}, a_{5}, a_{6}, a_{7}, a_{8}, a_{9}, a_{10}, a_{11}, a_{12}, a_{13},$

$a_{14}, a_{15}, a_{16}, a_{17}, a_{18}\}$ ，设置代价为 $\{10,60,10,10,10,$

10,10,10,10,10,10,10,50,40,30,20,10, 70\} .

表3至表7是ARCSGO算法在属性重要度和代价加权( $λ = 0.5$ )和仅考虑属性重要度( $λ = 0$ )时在五个数据集上的不同约简结果.首先该算法针对不同的分组方式可以得到不同的约简结果，其次在分组数 $G = 3$ 时，实验数据集所得的约简结果都有较高的 $S A P$ 值，这是因为算法在删除分组时优先保留了较高偏好的分组．从代价角度来看，局部加权的方法同样在属性组序的关系下成立， $λ = 0.5$ 时和 $λ = 0$ 时得到的约简结果相比，代价相等或更低．这里讨论一种最坏的情况，即当所有重要度较大的属性恰好为一个约简结果处于最高偏好的分组当中时，通过该算法优先按照属性组序的方式删除属性，最终得到的约简结果就为属性重要度算法的约简结果，其代价也就最大．所以该算法以属性组序作为用户首要的偏好关系，其次从各分组中添加属性，可以得到较高 $S A P$ 的约简结合，其代价与用户分组方式有关．

表3 Lymphography数据集在10种不同分组方式下的约简结果

Table 3 Reduction results of Lymphography dataset under ten different grouping methods

$I D$	$A R C S G O (λ = 0.5)$			$A R C S G O (λ = 0)$
$I D$	$R e d u c t$	$C O S T$	$S A P$	$R e d u c t$	$C O S T$	$S A P$
1	$a_{1}, a_{3}, a_{11}, a_{14}, a_{2}, a_{10}, a_{12}, a_{8}, a_{6}$	170	2.44	$a_{14}, a_{18}, a_{3}, a_{1}, a_{10}, a_{12}, a_{16}, a_{18}$	180	2.5
2	$a_{14}, a_{13}, a_{6}, a_{12}, a_{11}, a_{3}, a_{9}, a_{16}, a_{2}$	220	2.77	$a_{14}, a_{13}, a_{2}, a_{12}, a_{11}, a_{6}, a_{3}, a_{16}, a_{9}$	220	2.77
3	$a_{8}, a_{5}, a_{2}, a_{14}, a_{15}, a_{10}, a_{12}, a_{1}$	180	2.5	$a_{8}, a_{5}, a_{2}, a_{14}, a_{1}, a_{12}, a_{10}, a_{15}$	180	2.5
4	$a_{6}, a_{14}, a_{15}, a_{13}, a_{12}, a_{1}, a_{18}$	220	2.28	$a_{6}, a_{13}, a_{2}, a_{15}, a_{14}, a_{18}, a_{12}$	270	2.14
5	$a_{12}, a_{14}, a_{6}, a_{15}, a_{2}, a_{11}, a_{5}, a_{13}$	220	2.37	$a_{14}, a_{15}, a_{12}, a_{2}, a_{18}, a_{13}, a_{11}$	270	2.42
6	$a_{8}, a_{6}, a_{2}, a_{18}, a_{12}, a_{10}, a_{16}, a_{13}$	240	2.25	$a_{14}, a_{2}, a_{18}, a_{13}, a_{10}, a_{17}, a_{16}$	260	2.42
7	$a_{5}, a_{10}, a_{15}, a_{16}, a_{18}, a_{6}, a_{8}, a_{17}, a_{2}$	230	2.33	$a_{5}, a_{18}, a_{15}, a_{10}, a_{16}, a_{2}, a_{8}, a_{6}, a_{17}$	230	2.33
8	$a_{8}, a_{12}, a_{17}, a_{6}, a_{7}, a_{5}, a_{2}, a_{10}, a_{15}, a_{18}$	230	2.2	$a_{13}, a_{2}, a_{6}, a_{17}, a_{18}, a_{10}, a_{15}, a_{11}$	250	2.37
9	$a_{6}, a_{1}, a_{12}, a_{17}, a_{7}, a_{15}, a_{10}, a_{8}, a_{3}, a_{11}, a_{18}$	190	2.27	$a_{14}, a_{15}, a_{12}, a_{18}, a_{3}, a_{8}, a_{10}, a_{11}$	190	2.5
10	$a_{10}, a_{6}, a_{12}, a_{17}, a_{8}, a_{13}, a_{1}, a_{5}, a_{16}, a_{15}$	170	2.22	$a_{14}, a_{6}, a_{13}, a_{17}, a_{1}, a_{5}, a_{15}, a_{16}$	180	2.25

表4 Lung⁃Cancer数据集在10种不同分组方式下的约简结果

Table 4 Reduction results of Lung⁃Cancer dataset under ten different grouping methods

$I D$	$A R C S G O (λ = 0.5)$			$A R C S G O (λ = 0)$
$I D$	$R e d u c t$	$C O S T$	$S A P$	$R e d u c t$	$C O S T$	$S A P$
1	$b_{33}, b_{35}, b_{32}, b_{23}, b_{7}, b_{36}, b_{14}$	80	3	$b_{14}, b_{12}, b_{23}, b_{36}, b_{33}, b_{7}$	100	3
2	$b_{41}, b_{53}, b_{19}, b_{3}, b_{12}$	100	3	$b_{40}, b_{6}, b_{12}, b_{3}, b_{34}$	190	3
3	$b_{37}, b_{35}, b_{29}, b_{7}, b_{20}, b_{4}, b_{53}$	70	3	$b_{37}, b_{35}, b_{29}, b_{7}, b_{20}, b_{4}, b_{53}$	70	3
4	$b_{33}, b_{10}, b_{8}, b_{28}, b_{29}, b_{32}$	60	3	$b_{28}, b_{8}, b_{10}, b_{25}, b_{32}, b_{5}, b_{38}$	70	3
5	$b_{23}, b_{26}, b_{14}, b_{54}, b_{13}$	60	3	$b_{43}, b_{6}, b_{54}, b_{4}, b_{26}$	90	3
6	$b_{2}, b_{7}, b_{20}, b_{34}, b_{35}$	50	3	$b_{43}, b_{2}, b_{53}, b_{3}, b_{13}, b_{7}$	80	3
7	$b_{37}, b_{35}, b_{13}, b_{3}, b_{33}, b_{2}, b_{49}$	90	3	$b_{40}, b_{35}, b_{2}, b_{49}, b_{3}, b_{24}$	130	3
8	$b_{37}, b_{51}, b_{16}, b_{34}, b_{7}, b_{33}, b_{38}$	70	3	$b_{37}, b_{51}, b_{6}, b_{34}, b_{7}$	90	3
9	$b_{37}, b_{16}, b_{29}, b_{3}, b_{53}, b_{24}, b_{18}$	90	3	$b_{40}, b_{6}, b_{20}, b_{2}, b_{24}, b_{29}$	150	3
10	$b_{41}, b_{56}, b_{13}, b_{34}, b_{3}$	70	3	$b_{41}, b_{3}, b_{14}, b_{6}, b_{19}$	120	3

表5 Dermatology数据集在10种不同分组方式下的约简结果

Table 5 Reduction results of Dermatology dataset under ten different grouping methods

$I D$	$A R C S G O (λ = 0.5)$			$A R C S G O (λ = 0)$
$I D$	$R e d u c t$	$C O S T$	$S A P$	$R e d u c t$	$C O S T$	$S A P$
1	$c_{21}, c_{5}, c_{28}, c_{32}, c_{17}, c_{18}, c_{20}, c_{31}, c_{8}, c_{1}, c_{4}$	150	2.9	$c_{22}, c_{8}, c_{5}, c_{31}, c_{28}, c_{2}, c_{4}, c_{17}, c_{32}, c_{1}, c_{18}$	210	2.9
2	$c_{3}, c_{5}, c_{9}, c_{14}, c_{20}, c_{29}, c_{17}, c_{16}, c_{28}$	110	2.11	$c_{4}, c_{16}, c_{9}, c_{8}, c_{19}, c_{32}, c_{28}, c_{18}, c_{13}, c_{10}$	150	2.7
3	$c_{5}, c_{29}, c_{15}, c_{10}, c_{16}, c_{32}, c_{7}, c_{18}, c_{2}, c_{1}, c_{4}$	170	2.9	$c_{5}, c_{22}, c_{29}, c_{15}, c_{4}, c_{7}, c_{32}, c_{16}, c_{1}, c_{18}$	220	2.9
4	$c_{4}, c_{19}, c_{17}, c_{2}, c_{6}, c_{24}, c_{10}, c_{34}$	150	2.87	$c_{4}, c_{34}, c_{19}, c_{2}, c_{6}, c_{22}, c_{17}, c_{24}$	210	2.87
5	$c_{32}, c_{21}, c_{9}, c_{14}, c_{26}, c_{5}, c_{3}, c_{18}, c_{8}, c_{4}$	130	2.8	$c_{34}, c_{22}, c_{4}, c_{3}, c_{14}, c_{5}, c_{6}$	200	2.85
6	$c_{26}, c_{28}, c_{14}, c_{19}, c_{18}, c_{1}, c_{33}, c_{34}$	130	3	$c_{25}, c_{34}, c_{19}, c_{28}, c_{1}, c_{18}, c_{14}, c_{26}$	130	3
7	$c_{34}, c_{5}, c_{19}, c_{3}, c_{32}, c_{18}, c_{10}, c_{7}$	120	2.62	$c_{34}, c_{5}, c_{4}, c_{32}, c_{3}, c_{10}$	130	2.5
8	$c_{2}, c_{17}, c_{21}, c_{18}, c_{16}, c_{33}, c_{5}, c_{3}, c_{28}, c_{9}$	120	2.5	$c_{34}, c_{22}, c_{4}, c_{3}, c_{28}, c_{33}, c_{5}$	200	2.71
9	$c_{19}, c_{17}, c_{28}, c_{34}, c_{18}, c_{33}, c_{4}$	140	2.57	$c_{16}, c_{19}, c_{34}, c_{4}, c_{18}, c_{10}, c_{29}$	160	2.71
10	$c_{5}, c_{21}, c_{17}, c_{25}, c_{19}, c_{9}, c_{3}, c_{18}, c_{7}, c_{16}$	120	2.7	$c_{34}, c_{28}, c_{3}, c_{16}, c_{19}, c_{18}, c_{11}, c_{7}$	140	2.75

表6 BCW数据集在10种不同分组方式下的约简结果

Table 6 Reduction results of BCW dataset under ten different grouping methods

$I D$	$A R C S G O (λ = 0.5)$			$A R C S G O (λ = 0)$
$I D$	$R e d u c t$	$C O S T$	$S A P$	$R e d u c t$	$C O S T$	$S A P$
1	$d_{6}, d_{8}, d_{1}, d_{5}$	110	2.25	$d_{6}, d_{5}, d_{8}, d_{7}, d_{9}$	110	2.6
2	$d_{6}, d_{3}, d_{8}, d_{4}$	110	2.5	$d_{6}, d_{3}, d_{1}, d_{8}$	120	2.5
3	$d_{6}, d_{8}, d_{4}, d_{3}$	110	2	$d_{6}, d_{3}, d_{4}, d_{7}, d_{9}$	120	2.4
4	$d_{6}, d_{7}, d_{2}, d_{9}, d_{5}$	110	2.2	$d_{6}, d_{7}, d_{2}, d_{5}, d_{4}$	110	2.2
5	$d_{6}, d_{2}, d_{7}, d_{4}, d_{5}$	110	2.2	$d_{6}, d_{3}, d_{7}, d_{5}$	130	2
6	$d_{6}, d_{7}, d_{9}, d_{8}, d_{5}$	110	2.2	$d_{6}, d_{3}, d_{4}, d_{8}$	110	2.25
7	$d_{6}, d_{7}, d_{9}, d_{8}, d_{5}$	110	2.2	$d_{6}, d_{1}, d_{5}, d_{8}$	110	2.25
8	$d_{6}, d_{7}, d_{1}, d_{4}$	90	2.5	$d_{6}, d_{7}, d_{1}, d_{4}$	90	2.5
9	$d_{6}, d_{2}, d_{7}, d_{1}$	90	2.25	$d_{6}, d_{5}, d_{1}, d_{8}$	110	2.25
10	$d_{6}, d_{2}, d_{7}, d_{1}$	90	2	$d_{6}, d_{3}, d_{1}, d_{8}$	120	2.25

表7 Connect⁃4数据集在10种不同分组方式下的约简结果

Table 7 Reduction results of Connect⁃4 dataset under ten different grouping methods

$I D$	$A R C S G O (λ = 0.5)$			$A R C S G O (λ = 0)$
$I D$	$R e d u c t$	$C O S T$	$S A P$	$R e d u c t$	$C O S T$	$S A P$
1	$e_{26}, e_{32}, e_{14}, e_{7}, e_{8}, e_{21}, e_{25}, e_{1},$ $e_{2}, e_{31}, e_{22}, e_{28}, e_{37}$	670	1.76	$e_{14}, e_{21}, e_{15}, e_{26}, e_{1}, e_{31}, e_{25}, e_{2},$ $e_{38}, e_{22}, e_{37}, e_{13}, e_{24}, e_{16}$	760	2.07
2	$e_{14}, e_{1}, e_{17}, e_{7}, e_{21}, e_{15}, e_{26}, e_{22}, e_{16},$ $e_{3}, e_{31}, e_{8}, e_{38}, e_{28}, e_{25}, e_{32}, e_{2}$	720	1.94	$e_{14}, e_{21}, e_{1}, e_{37}, e_{15}, e_{22}, e_{31}, e_{8},$ $e_{26}, e_{3}, e_{25}, e_{32}, e_{13}, e_{2}$	850	1.92
3	$e_{16}, e_{2}, e_{31}, e_{22}, e_{21}, e_{38}, e_{32}, e_{7},$ $e_{8}, e_{37}, e_{1}, e_{25}, e_{15}$	700	2	$e_{31}, e_{2}, e_{26}, e_{22}, e_{21}, e_{37}, e_{8}, e_{7},$ $e_{32}, e_{38}, e_{1}, e_{15}, e_{25}$	700	2
4	$e_{22}, e_{1}, e_{17}, e_{8}, e_{15}, e_{25}, e_{38}, e_{16}, e_{28},$ $e_{14}, e_{7}, e_{26}, e_{31}, e_{3}, e_{32}, e_{2}, e_{21}$	720	2.05	$e_{1}, e_{37}, e_{13}, e_{15}, e_{25}, e_{14}, e_{7}, e_{16},$ $e_{21}, e_{31}, e_{23}, e_{2}, e_{32}$	940	2.07
5	$e_{38}, e_{17}, e_{7}, e_{21}, e_{1}, e_{25}, e_{16}, e_{32}, e_{2},$ $e_{8}, e_{22}, e_{26}, e_{3}, e_{28}, e_{14}, e_{31}, e_{15}$	720	2.17	$e_{37}, e_{21}, e_{1}, e_{8}, e_{25}, e_{2}, e_{13}, e_{32},$ $e_{22}, e_{15}, e_{31}, e_{26}, e_{3}, e_{14}$	850	2.28
6	$e_{39}, e_{32}, e_{1}, e_{7}, e_{25}, e_{31}, e_{3}, e_{16}, e_{27},$ $e_{17}, e_{2}, e_{8}, e_{22}, e_{14}, e_{15}, e_{37}, e_{13}$	870	2.05	$e_{1}, e_{21}, e_{26}, e_{38}, e_{31}, e_{25}, e_{2}, e_{23},$ $e_{8}, e_{14}, e_{15}, e_{37}, e_{22}, e_{13}$	960	2.07
7	$e_{22}, e_{38}, e_{32}, e_{25}, e_{31}, e_{1}, e_{21}, e_{7},$ $e_{8}, e_{16}, e_{2}, e_{37}, e_{15}$	700	2.07	$e_{22}, e_{38}, e_{14}, e_{21}, e_{31}, e_{1}, e_{25}, e_{26},$ $e_{7}, e_{37}, e_{23}, e_{15}, e_{2}, e_{16}$	810	2.14
8	$e_{25}, e_{28}, e_{32}, e_{14}, e_{8}, e_{22}, e_{38}, e_{3}, e_{2},$ $e_{17}, e_{15}, e_{31}, e_{1}, e_{26}, e_{16}, e_{7}, e_{21}$	720	2.05	$e_{37}, e_{14}, e_{25}, e_{23}, e_{15}, e_{2}, e_{22}, e_{38},$ $e_{31}, e_{21}, e_{7}, e_{1}, e_{16}, e_{26}$	810	2.21
9	$e_{22}, e_{16}, e_{1}, e_{26}, e_{38}, e_{24}, e_{14}, e_{15},$ $e_{37}, e_{25}, e_{31}, e_{9}, e_{2}, e_{7}, e_{21}$	700	2.2	$e_{1}, e_{22}, e_{16}, e_{37}, e_{15}, e_{14}, e_{26}, e_{38},$ $e_{21}, e_{7}, e_{31}, e_{23}, e_{2}, e_{25}$	810	2.21
10	$e_{38}, e_{1}, e_{32}, e_{2}, e_{17}, e_{14}, e_{8}, e_{22}, e_{25},$ $e_{31}, e_{15}, e_{26}, e_{16}, e_{3}, e_{28}, e_{7}, e_{21}$	720	1.94	$e_{14}, e_{37}, e_{1}, e_{2}, e_{38}, e_{31}, e_{25}, e_{15}, e_{13},$ $e_{21}, e_{7}, e_{26}, e_{16}, e_{9}, e_{28}, e_{33}$	840	2.12

图2至图6分别表示在五个数据集中10种不同分组方式下， $A R C S G O$ 算法、 $A R W W$ 算法以及属性重要度算法(Attribute Reduction algorithm with Importance，ARWI)^[26]的 $S A P$ 值．不难发现 $A R C S G O$ 算法比 $A R W W$ 算法和 $A R W I$ 算法有更高的 $S A P$ 值，表明本文提出的算法能够找到当前分组情形下更为靠前，也就是偏好程度更高的属性集．其原因在于该算法考虑了用户的属性组序偏好关系，优先保留了具有较高偏好的属性，其中参数 $λ$ 作为局部分组中选取属性时的变量，对 $S A P$ 的影响较小，可以发现 $λ = 0$ 和 $λ = 0.5$ 的 $S A P$ 值相差不大．当 $A R W W$ 算法或 $A R W I$ 算法的约简集合位于偏好程度最高的分组中时，此时 $A R C S G O$ 算法与其余两种算法求得的约简集合具有相同的 $S A P$ 值．一般情形下， $A R C S G O$ 算法能得到更符合用户偏好的约简子集．

图2

图2 Lymphography数据集的SAP值

Fig.2 SAP value of Lymphography dataset

图3

图3 Lung⁃Cancer数据集的SAP值

Fig.3 SAP value of Lung⁃Cancer dataset

图4

图4 Dermatology数据集的SAP值

Fig.4 SAP value of Dermatology dataset

图5

图5 BCW数据集的SAP值

Fig.5 SAP value of BCW dataset

图6

图6 Connect⁃4数据集的SAP值

Fig.6 SAP value of Connect⁃4 dataset

图7至图11表示的是 $A R C S G O$ 算法在 $λ = 0$ ， $λ = 0.5$ 以及 $λ = 1$ 取值下，随着分组数的增加，代价变化的曲线图．这里选取每个数据集中 $I D = 1$ 的例子不断细化分组，在细化的过程中保留上一级分组的偏好关系．可以发现，随着分组数的增加，不同 $λ$ 值下的代价最终都会趋于稳定，其原因在于随着分组数增加，每组中的属性个数在减少，用户对属性的偏好越明确，并且本文提出的算法是优先考虑属性组序关系， $λ$ 仅作为局部选择属性的参数，随着分组数的增加，其影响力也逐渐变小．

图7

图7 Lymphography数据集在不同组数下的代价

Fig.7 Cost under different groups of Lymphography dataset

图8

图8 Lung⁃Cancer数据集在不同组数下的代价

Fig.8 Cost under different groups of Lung⁃Cancer dataset

图9

图9 Dermatology数据集在不同组数下的代价

Fig.9 Cost under different groups of Dermatology dataset

图10

图10 BCW数据集在不同组数下的代价

Fig.10 Cost under different groups of BCW dataset

图11

图11 Connect⁃4数据集在不同组数下的代价

Fig.11 Cost under different groups of Connect⁃4 dataset

4 结论

考虑用户偏好及现实生活中的代价敏感问题，本文提出一种属性组序下基于代价敏感的约简算法．该算法结合分组的思想，可对整组属性进行删除，然后再对组内属性进行添加．通过局部加权的方式添加属性和利用属性重要度选择属性相比，能够获取代价更低的约简集合．本文定义平均子集位置来描述子集在该分组中所处位置的高低，代表用户对子集的偏好程度，所设计的算法能获取较高的 $S A P$ 值．实例分析以及在UCI数据集上的实验证明了该算法的可行性和有效性．本文主要针对符号数据采用等价关系进行研究，对于实数型数据或者混合型数据，可以将等价关系拓展到领域关系或者模糊关系．其次，用户对属性的分组方式是千变万化的，用户对属性目标越明确，加权方式越会随着分组数的增加而导致影响力的降低.何确定合适的组数以得到代价更低和SAP值更高的约简，是下一步研究的重点．

参考文献

原文顺序

文献年度倒序

文中引用次数倒序

被引期刊影响因子

[1]

Pawlak

Rough sets

International Journal of Computer and Information Sciences，1982，11(5)：341-356.

[2]

王国胤，姚一豫，于洪.

粗糙集理论与应用研究综述

计算机学报，2009，32(7)：1229-1246.

Wang

G Y

，Yao

Y Y

，Yu

A survey on rough set theory and applications

Chinese Journal of Computers，2009，32(7)：1229-1246.

[3]

于洪，王国胤，姚一豫.

决策粗糙集理论研究现状与展望

计算机学报，2015，38(8)：1628-1639.

，Wang

G Y

，Yao

Y Y

Current research and future perspectives on decision：theoretic rough sets

Chinese Journal of Computers，2015，38(8)：1628-1639.

[4]

陈昊，杨俊安，庄镇泉.

变精度粗糙集的属性核和最小属性约简算法

计算机学报，2012，35(5)：1011-1017.

Chen

，Yang

J A

，Zhuang

Z Q

The core of attributes and minimal attributes reduction in variable precision rough set

Chinese Journal of Computers，2012，35(5)：1011-1017.

[5]

Yao

Y Y

，Lin

T Y

Generalization of rough sets using modal logics

Intelligent Automation & Soft Computing，1996，2(2)：103-119.

[6]

Yao

Y Y

Decision⁃theoretic rough set models

∥Proceedings of 2^nd International Conference on Rough Sets and Knowledge Technology. Springer Berlin Heidelberg，2007.

[7]

杨传健，葛浩，汪志圣.

基于粗糙集的属性约简方法研究综述

计算机应用研究，2012，29(1)：16-20.

Yang

C J

，Ge

，Wang

Z S

Overview of attribute reduction based on rough set

Application Research of Computers，2012，29(1)：16-20.

[8]

Qian

Y H

，Liang

J Y

，Pedrycz

，et al.

Positive approximation：an accelerator for attribute reduction in rough set theory

Artificial Intelligence，2010，174(9-10)：597-618.

[本文引用: 4]

[9]

Lazo⁃Cortés

M S

，Martínez⁃Trinidad

J F

，Carrasco⁃Ochoa

J A

，et al.

A new algorithm for computing reducts based on the binary discernibility matrix

Intelligent Data Analysis，2016，20(2)：317-337.

[10]

Gao

，Lai

Z H

，Zhou

，et al.

Maximum decision entropy⁃based attribute reduction in decision⁃theoretic rough set model

Knowledge⁃Based Systems，2018，143：179-191.

[11]

Wang

Y B

，Chen

X J

，Dong

Attribute reduction via local conditional entropy

International Journal of Machine Learning and Cybernetics，2019，10(12)：3619-3634.

[12]

Min

，He

H P

，Qian

Y H

，et al.

Test⁃cost⁃sensitive attribute reduction

Information Sciences，2011，181(22)：4928-4942.

[13]

Fang

，Min

Cost⁃sensitive approximate attribute reduction with three⁃way decisions

International Journal of Approximate Reasoning，2019，104：148-165.

[14]

X A

，Zhao

X R

Cost⁃sensitive three⁃way class⁃specific attribute reduction

International Journal of Approximate Reasoning，2019，105：153-174.

[15]

Jia

X Y

，Liao

W H

，Tang

Z M

，et al.

Minimum cost attribute reduction in decision⁃theoretic rough set models

Information Sciences，2013，219：151-167.

[16]

Wang

，Wang

Reduction algorithms based on discernibility matrix：the ordered attributes method

Journal of Computer Science and Technology，2001，16(6)：489-504.

[本文引用: 3]

[17]

Zhao

，Wang

A reduction algorithm meeting users' requirements

Journal of Computer Science and Technology，2002，17(5)：578-593.

[本文引用: 3]

[18]

Yao

Y Y

，Zhao

，Wang

，et al.

A model of machine learning based on user preference of attributes

∥International Conference on Rough Sets and Current Trends in Computing. Springer Berlin Heidelberg，2006：587-596.

[19]

Yao

Y Y

，Zhao

，Wang

，et al.

A model of user⁃oriented reduct construction for machine learning

Transactions on Rough Sets VIII. Springer Berlin Heidelberg，2008：332-351.

[20]

Han

S Q

，Wang

Reduct and attribute order

Journal of Computer Science and Technology，2004，19(4)：429-449.

[21]

官礼和，王国胤，胡峰.

一种基于属性序的决策规则挖掘算法

控制与决策，2012，27(2)：313-316.

Guan

L H

，Wang

G Y

，Hu

A decision rules mining algorithm based on attribute order

Control and Decision，2012，27(2)：313-316.

[22]

韩素青，阴桂梅.

一种面向用户需求的属性约简算法

模式识别与人工智能，2014，27(3)：281-288.

Han

S Q

，Yin

G M

An user⁃oriented attribute reduct construction algorithm

Pattern Recognition and Artificial Intelligence，2014，27(3)：281-288.

[23]

胡峰，王国胤.

属性序下的快速约简算法

计算机学报，2007，30(8)：1429-1435.

，Wang

G Y

Quick reduction algorithm based on attribute order

Chinese Journal of Computers，2007，30(8)：1429-1435.

[24]

王国胤

. Rough集理论与知识获取. 西安：西安交通大学出版社，2001：23-26，133-136.

[本文引用: 5]

[25]

Zhang

Q H

，Shen

Research on attribute reduction algorithm with weights

Journal of Intelligent & Fuzzy Systems，2014，27(2)：1011-1019.

[本文引用: 2]

[26]

K Y

,Lu

Y C

,Shi

C Y

Advances in rough set theory and its applicatinons

Journal of Tsinghua University (Science and Technology),2001,41(1):64-68.