Single-Channel Speech Separation Based on Non-negative Matrix Factorization and Factorial Conditional Random Field
-
Graphical Abstract
-
Abstract
A new Non-negative matrix factorization (NMF) based algorithm is proposed for single-channel speech separation with a prior known speakers, which aims to better model the spectral structure and temporal continuity of speech signal. First, NMF and k-means clustering are employed to obtain multiple small dictionaries as well as a state sequence that describes the temporal dynamics between these dictionaries for each speaker. Then, a Factorial conditional random field (FCRF) model is trained using the state sequences and dictionaries to jointly model the temporal continuity of two speakers' mixed signal for separation. Experiments show that the proposed algorithm outperforms the baselines with respect to all metrics, for example sparse NMF (+1.12dB SDR, +2.37dB SIR, +0.40dB SAR, +0.2 MOS), nonnegative factorial hidden Markov model (+2.04dB SDR, +4.26dB SIR, +0.62dB SAR, +1.0 MOS) and standard NMF (+2.8dB SDR, +5.08dB SIR, +1.06dB SAR, +1.2 MOS).
-
-