Abstract
Consensus
clustering is the problem of reconciling clustering information about
the same data set coming from different sources or from different runs
of the same algorithm. Cast as an optimization problem, consensus
clustering is known as median partition, and has been shown to be
NP-complete. A number of heuristics have been proposed as approximate
solutions, some with performance guarantees. In practice, the problem is
apparently easy to approximate, but guidance is necessary as to which
heuristic to use depending on the number of elements and clusterings
given. We have implemented a number of heuristics for the consensus
clustering problem, and here we compare their performance, independent
of data size, in terms of efficacy and efficiency, on both simulated and
real data sets. We find that based on the underlying algorithms and
their behavior in practice the heuristics can be categorized into two
distinct groups, with ramification as to which one to use in a given
situation, and that a hybrid solution is the best bet in general. We
have also developed a refined consensus clustering heuristic for the
occasions when the given clusterings may be too disparate, and their
consensus may not be representative of any one of them, and we show that
in practice the refined consensus clusterings can be much superior to
the general consensus clustering.