Abstract:
Clustering ensembles have
emerged as a powerful method for improving both the robustness as well
as the stability of unsupervised classification solutions. However,
findi...
View more
Metadata
Abstract:
Clustering ensembles have
emerged as a powerful method for improving both the robustness as well
as the stability of unsupervised classification solutions. However,
finding a consensus clustering from multiple partitions is a difficult
problem that can be approached from graph-based, combinatorial, or
statistical perspectives. This study extends previous research on
clustering ensembles in several respects. First, we introduce a unified
representation for multiple clusterings and formulate the corresponding
categorical clustering problem. Second, we propose a probabilistic model
of consensus using a finite mixture of multinomial distributions in a
space of clusterings. A combined partition is found as a solution to the
corresponding maximum-likelihood problem using the EM algorithm. Third,
we define a new consensus function that is related to the classical
intraclass variance criterion using the generalized mutual information
definition. Finally, we demonstrate the efficacy of combining partitions
generated by weak clustering algorithms that use data projections and
random data splits. A simple explanatory model is offered for the
behavior of combinations of such weak clustering components. Combination
accuracy is analyzed as a function of several parameters that control
the power and resolution of component partitions as well as the number
of partitions. We also analyze clustering ensembles with incomplete
information and the effect of missing cluster labels on the quality of
overall consensus. Experimental results demonstrate the effectiveness of
the proposed methods on several real-world data sets.
Published in:
IEEE Transactions on Pattern Analysis and Machine Intelligence
(
Volume: 27
, Issue: 12
, Dec. 2005
)