A Novel Subspace-Based GMM Clustering Ensemble Algorithm for High-dimensional Data
-
Graphical Abstract
-
Abstract
The Gaussian mixture model (GMM) is a classical probability representation model widely used in unsupervised learning. GMM performs poorly on high-dimensional data (HDD) due to the requirement of estimating a large number of parameters with relatively few observations. To address this, the paper proposes a novel subspace-based GMM clustering ensemble (SubGMM-CE) algorithm tailored for HDD. The proposed SubGMM-CE algorithm comprises three key components. First, a series of low-dimensional subspaces are dynamically determined, considering the optimal number of GMM components. The GMM-based clustering algorithm is applied to each subspace to obtain a series of heterogeneous GMM models. These GMM base clustering results are merged using the newly-designed relabeling strategy based on the average shared affiliation probability, generating the final clustering result for high-dimensional unlabeled data. An exhaustive experimental evaluation validates the feasibility, rationality, effectiveness, and robustness to noise of the SubGMM-CE algorithm. Results show that SubGMM-CE achieves higher stability and more accurate clustering results, outperforming nine state-of-the-art clustering algorithms in normalized mutual information, clustering accuracy, and adjusted rand index scores. This demonstrates the viability of the SubGMM-CE algorithm in addressing HDD clustering challenges.
-
-