Sekundarni povzetek: |
This dissertation investigates how to adapt standard classification rule
learning approaches to subgroup discovery. The goal of subgroup
discovery is to find rules describing subsets of a selected population
that are sufficiently large and statistically unusual in terms of class
distribution. The dissertation presents a subgroup discovery algorithm,
CN2-SD, developed by modifying parts of the CN2 classification rule
learner: its covering algorithm, search heuristic, probabilistic
classification of instances, and evaluation measures. Experimental
evaluation of CN2-SD on selected data sets shows substantial reduction
of the number of induced rules, increased rule coverage, rule
significance and overall coverage of the target concept as well as
slight improvements in terms of the area under ROC curve, when compared
with rule learning algorithms CN2 and RIPPER. An application of CN2-SD
to a large traffic accident data set confirms these findings.
This dissertation presents also the subgroup discovery algorithm
APRIORI-SD, developed by adapting association rule learning to subgroup
discovery. This was achieved by building a classification rule learner
APRIORI-C, enhanced with a novel post–processing mechanism, a new
quality measure for induced rules (weighted relative accuracy) and using
probabilistic classification of instances. Experimental results a
similar behavior of APRIORI-SD and the subgroup discovery algorithm
CN2-SD i.e. substantial reduction of the number of induced rules,
increased rule coverage, rule significance and overall coverage of the
target concept as well as slight improvements in terms of the area under
ROC curve, when compared with rule learning algorithms CN2, RIPPER and
APRIORI-C.
A new optimization approach to subgroup discovery based on ROC analysis
is also presented and implemented as an adaptation of the APRIORI-SD
algorithm. The implications of the
“number-of-rules–unusualness–coverage” trade off to subgroup discovery
are investigated through an experimental evaluation of the adapted
APRIORI-SD algorithm on selected data sets. The results are presented in
the form of 2D graphs depicting the dependencies between the number of
induced rules, unusualness, accuracy and overall coverage of the target
concept and the original APRIORI-SD subgroup discovery algorithm is
discussed in this new optimization framework.
Finally, the dissertation presents the comparison of the new algorithms
with existing state–of–the–art subgroup discovery algorithms and the
application of CN2-SD and APRIORI-SD to a real–life problem – the
traffic accident database – a database describing traffic accidents in
Great Britain. |