Separation of multiple speech sources by recovering sparse and non-sparse components from B-format microphone recordings
This paper proposes a blind source separation (BSS) method for recovering multiple speech sources from sound fields recorded by a B-format microphone. This microphone provides a four channel representation that can be used to derive the direction of arrival (DOA) of spatially distinct time-frequency (TF) components. Such sparse components correspond to bins where only one speech source is active and are identified based on the inter-correlation among the mixture signals. They are recovered via a degenerate unmixing estimation technique (DUET)-like method. Proposed is a "local-zone stationarity" assumption, where the amplitude of a speech signal remains approximately constant within a small band of TF components. This assumption is validated through statistical analysis of a quantitative measure of stationarity. Under this assumption, the non-sparse components (TF points where more than one speech source is activ e) are recovered via a Wiener-filter-like approach where the separated sparse components is utilized as a guide. The final separated sources are obtained by combining the separated sparse and non-sparse components. Both objective and subjective evaluations show that the proposed method achieves better separation quality compared to some existing BSS approaches where up to six simultaneous speech sources are considered.