Doctor of Philosophy
School of Electrical, Computer and Telecommunications Engineering
Cheng, Eva Chung-Wai, Spatial analysis of multiparty speech scenes, Doctor of Philosophy thesis, School of Electrical, Computer and Telecommunications Engineering, University of Wollongong, 2012. http://ro.uow.edu.au/theses/3713
This thesis conducts a series of investigations into the estimation of speaker location cues from multi-party meeting speech recordings. As participants in meetings generally remain stationary, speaker location information is fundamentally useful for higher-level tasks such as steering a microphone array beamformer towards an active speaker, or segmenting meeting speech into each speaker’s period of participation for ‘browsing’ of recordings and speech recognition.
Whilst existing speaker location cues are typically Time-Delay Estimations (TDE) estimated from microphone array signals, this thesis proposes the use of level and time/phase-based spatial cues, motivated by the spatial cues utilised in recently standardised Spatial Audio Coding (SAC) paradigms. Implemented using leading and standardised SAC techniques, experiments compared the proposed SAC spatial cues with TDE and found the combination of TDE with SAC level-based cues to be the most accurate for speech segmentation.
As meetings recordings predominantly contain speech content, front-end Linear- Prediction (LP) analysis using theoretical and standardised speech coders is then investigated with single and multichannel LP models. Whilst existing approaches estimate TDE from the Hilbert envelope of single-channel LP speech residuals, this thesis proposes the use of intra and interchannel multichannel prediction and found spatial cues estimated from the Hilbert envelope of LP residuals to be the most robust against reverberation.
Further experiments investigating the effect of microphone array characteristics found the microphone directivity pattern to significantly influence spatial cue estimation: the omnidirectional and cardioid polar responses optimally suit time/phase and levelbased cues, respectively. In practice, however, switching microphone patterns or employing mixed pattern arrays is impractical. This thesis proposes the use of the Ambisonic B-format steerable ‘virtual microphone’ to enable the same physical microphones to be simultaneously used for optimal capture of both time/phase and level-based cues. Further, results indicate that steering the virtual microphone in real-time to an active speaker, localised using sound intensity techniques, also improves meeting speech capture.
Thus, the work in this thesis has contributions in practical spatial meeting speech analysis, where investigations studied microphone array characteristics, spatial recording techniques, and algorithms in spatial cue estimation and speech processing as utilised in internationally standardised speech and audio coders.