Publication Details

X. Zheng, C. Ritz & J. Xi, "Encoding and communicating navigable speech soundfields," Multimedia Tools and Applications, vol. 75, pp. 5183-5204, 2016.


This paper describes a system for encoding and communicating navigable speech soundfields for applications such as immersive audio/visual conferencing, audio surveillance of large spaces and free viewpoint television. The system relies on recording speech soundfields using compact co-incident microphone arrays that are then processed to identify sources and their spatial location using the well-known assumption that speech signals are sparse in the time-frequency domain. A low-delay Direction of Arrival (DOA)-based frequency domain sound source separation approach is proposed that requires only 250 ms of speech signal. Joint compression is achieved through a previously proposed perceptual analysis-by-synthesis spatial audio coding scheme that encodes sources into a mixture signal that can be compressed by a standard speech codec at 32 kbps. By also transmitting side information representing the original spatial location of each source, the received mixtures can be decoded and then flexibly reproduced using loudspeakers at a chosen listening point within a synthesised speech scene. The system was implemented based on this framework for an example application encoding a three-talker navigable speech scene at a total bit rate of 48 kbps. Subjective listening tests were conducted to evaluate the quality of the reproduced speech scenes at a new listening point as compared to a true recording at that point. Results demonstrate the approach successfully encodes multiple spatial speech scenes at low bit rates whilst maintaining perceptual quality in both anechoic and reverberant environments.

Grant Number