Doctor of Philosophy
School of Electrical, Computer and Telecommunications Engineering - Faculty of Engineering
Smith, Daniel, An analysis of blind signal separation for real time application, PhD thesis, School of Electrical, Computer and Telecommunications Engineering, University of Wollongong, 2006. http://ro.uow.edu.au/theses/659
The ‘cocktail party problem’ is the term commonly used to describe the perceptual problem experienced by a listener who attempts to focus upon a single speaker in a scene of interfering audio and noise sources. Blind Signal Separation (BSS) is a blind identification approach that can offer an adaptive, intelligent solution to the ‘cocktail party problem’. Audio signals can be blindly retrieved from the mixture, that is, without a priori knowledge of the audio signals or the location of the audio sources and sensors. Hence, BSS exhibits greater flexibility than other identification approaches, such as adaptive beamforming, which require precise knowledge of the sensors and/or signal locations.
Speech enhancement is a potential application of BSS. In particular, BSS is potentially useful for the enhancement of speech in interactive voice technologies. However, interactive voice technologies, such as mobile telephony or eleconferencing, require real time processing (on a frame-by-frame basis), as longer processing delays are considered intolerable for the participants of the two-way communication. Hence, BSS applications with interactive voice technologies require real-time operation of the algorithm.
BSS primarily employs Independent Component Analysis (ICA) as the criteria to separate speech signals. Separation is achieved with ICA when statistical independence between the signal estimates is established. However, investigations in this Thesis, that study the relationship between the ICA criteria and speech signals indicate that significant statistical dependencies can exist between short frames of speech. Hence, it was found that the ICA criteria could be unreliable for real-time speech separation.
This Thesis proposes a number of BSS algorithms that improve real-time separation performance in acoustic environments. In addition, these algorithms are shown to be better equipped to handle the dynamic nature of acoustic environments that contain moving speakers. The algorithms exhibit higher data efficiency, that is, these approaches accurately separate the acoustic scene with smaller amounts of data. The higher data efficiency is the result of BSS models that better represent the underlying characteristics of audio, and in particular speech in the mixture.
Sparse Component Analysis (SCA) algorithms are proposed to exploit the sparse representation of audio in the time-frequency (t-f) domain. Conventional SCA approaches generally place strong constraints upon signals, requiring them to be highly sparse across their entire t-f representation. This constraint is not always satisfied by broadband audio, particularly speech, and hence separation performance is reduced. The SCA algorithms developed in this Thesis relax this constraint, such that signals can be estimated from sparse sub-regions of the t-f representation rather than the complete t-f representation. A SCA algorithm that employs K-means clustering of the t-f space is proposed in order to improve the accuracy of estimation. In addition, an exponential averaging function is used to reduce the influence of poor estimates when separation is performed on a frame by frame basis.
Sequential approaches to SCA are proposed in this Thesis where only a sparse subregion of one signal in the mixture is required for estimation at one time. This relaxes the sparsity constraints that are placed upon broadband signals in the mixture.
A BSS algorithm that jointly models the production mechanisms of speech (pitch and spectral envelope) is also presented in this Thesis. This produces a more accurate model of speech than existing algorithms that individuallymodel the pitch or spectral envelope. An investigation of this algorithm then determines the parameter set that optimally models the underlying speech signals in the mixture.
Finally, an algorithm is proposed to exploit both the sparse t-f representation of audio and the joint model of speech production. This unified approach compares the SCA and speech production mechanism criteria, switching to the criteria that provides the most accurate estimate. Results indicate that this unified algorithm offers a superior data efficiency to its constituent algorithms, and to three benchmark ICA algorithms.
02Whole.pdf (2403 kB)