Degree Name

Doctor of Philosophy


School of Electrical, Computer, and Telecommunications Engineering


In recent years, significant development of digital audio signal processing has achieved practical solutions to 2D soundfield reproduction using headphones and loudspeakers. The perceptual audio quality is extensively enhanced by providing rich spatialisation effects for the surround sound compared to traditional monophonic and stereophonic reproduction systems. While surround sound has been widely commercialised, the audience can only perceive the soundfield in a pre-rendered manner. However, with the recent development on 3D TV and free viewpoint TV, the visual content can be flexibly perceived where the users can view the recorded scene from their desired angle and viewpoint. These new technologies can also be employed for the surveillance of large public spaces or infrastructure where the ability to zoom-in and view from multiple angles is desired. While the visual contents can be selectively reproduced, the selective audio playback is required to accompany the changing video scenes.

The soundfield navigation technology presented in this thesis provides a sound object based solution to achieve soundfield navigation, which allows the listeners to selectively choose the desired listening position of the recorded soundfield by receiving the same audio stream. The proposed framework aims to provide a complete solution starting from the recording configuration to post-processing of the recordings, followed by efficient compression and packet-loss protection techniques. The recording and post-processing techniques are firstly presented. The proposed framework employs a pair of co-located microphone arrays to capture suffcient information that ensures soundfield navigation. Compared to a traditional recording set-up, using a pair of co-located microphone arrays ensures enhanced source separation quality as well as the ability to estimate the geographical location of the sources.

In this thesis, a new low-delay stochastic-based direction of arrival estimation approach is presented that is used in the individual co-located microphone array to achieve low-delay (approximately 150 ms) direction of arrival estimation. The proposed direction of arrival estimation method employs a time-frequency energy weighted direction of arrival histogram that achieves improved estimation performance compared to existing histogram-based methods.

These direction of arrival estimates are then used within the proposed collaborative blind source separation scheme to achieve enhanced blind source separation. Comparing existing blind source separation approaches that are based on separating the sparse time-frequency component of the recorded signals, the collaborative approach can also separate the non-sparse time-frequency components and thus ensures enhanced separation performance. The graphical location of the sources can be also obtained from the microphone collaboration.

A psychoacoustic-based analysis-by-synthesis framework that compresses the separated sources and their spatial locations is also proposed. The proposed compression framework is able to compress up to three simultaneously occurring speech sources into one mono mixture signal that can be compressed using a traditional speech codec at 32 kbps. The spatial information of the sources can be preserved as side information indicating the origin of the time-frequency sources. In the reproduction site, by receiving the same mono mixture plus side information, the audiences can selectively navigate to their desired listening point as well as selective play back of interested sources. This compression scheme is extended to jointly compress up to 8 audio objects using a stereo mixture signal by the extended multilevel decomposition scheme.

Finally, a hybrid MDC-FEC based joint source-channel coding scheme is presented. The proposed scheme is firstly analysed from a theoretical point of view and then applied to speech and audio mixture signals created by the perceptual-based analysis-by-synthesis framework. The theoretical evaluation results indicate that the proposed approach leads to an optimal model for joint source-channel coding based on a combination of MDC and FEC. These models are then applied to protect the speech and audio mixture signals for practical packet-loss transmission. By dynamically adjusting optimised transmission models, the quality of the speech and audio objects are maintained as confirmed by evaluations. The robust low-delay soundfield navigation system is ensured for practical transmission channels.

The independent and correlated soundfield navigation frameworks are presented by combining the techniques proposed in this thesis. The independent soundfield navigation system aims to achieve soundfield navigation between independent soundfields. Practical applications include multi-party teleconferencing where each site can be regarded as one independent soundfield, or surveillance for a large area where each site is considered as one independent soundfield. A correlated soundfield navigation system is also proposed to achieve free listening point navigation within a large soundfield through limited number of correlated observations of the soundfield. The audience is able to select any listening point within the soundfield.