Degree Name

Doctor of Philosophy


School of Electrical, Computer and Telecommunications Engineering


Millions of user generated video blogs expressing opinions and feelings about products, events, news and many other issues are produced and uploaded to websites including social networks every day. These videos are usually produced and edited by non-professional producers using inexpensive equipment, which results in the audio tracks containing noisy, spontaneous and unstructured speech. The popularity of such videos and creditability of their content has increased the demand for suitable automatic analysis techniques since the traditional methods were not designed to address such content and quality challenges. These techniques will help the development of applications such as detecting offensive content and collecting user feedback about products to support manufacturing and production decisions.

This thesis presents a system for speaker independent keyword spotting (KWS) in continuous speech that can help in automatic analysis, indexing, search and retrieval of such videos by content. The keyword spotting system is based on dynamic time warping (DTW) for matching one spoken example to a test utterance. The system introduces a solution for the speaker dependency problem of the traditional DTW approach. Compared to a hidden Markov model (HMM) approach as traditionally used in automatic speech recognition (ASR), the proposed approach does not require acoustic modelling, training or language modelling. This is of particular relevance to user blog videos since they often contain key words of interest that have not been adequately represented in a training database (e.g. topical words that are emerging in society, coarse language, product names and personal names). A DTW distance histogram for automatic estimation of similarity thresholds for every keyword-utterance pair is introduced.

Experiments conducted on a wide range of sentences of clean speech and keywords show that when only a few examples of the keyword are available, the proposed system has a higher performance than an HMM-based approach. Extensive experiments also confirm that the proposed approach is superior to a HMM approach when applied to speech corrupted by two types of noise and with different signal-tonoise ratio (SNR) levels. The superior performance of the proposed approach is also maintained following the application of a variety of traditional speech enhancement algorithms as well as a feature enhancement algorithm used within a HMM approach.

Another contribution of this thesis is that it investigates the feasibility of temporal sentiment detection for those videos by analysing the transcription generated by a speech recognition system. Also proposed is a solution to the problem of fixed threshold estimation used for the output probabilities of a Naïve Bayesian classifier applied to the transcription text and irrelevant text filtering for improving the sentiment classification. The proposed solution uses clustering algorithms instead of static thresholds for predicting sentiment from the output probabilities of the Naïve Bayesian classifier. Experiments show that the proposed use of clustering improved the sentiment analysis results compared to using a traditional thresholdsbased approach. The proposed keyword spotting was also used with the text sentiment analysis system to develop a temporal sentiment analysis system suitable for the unconstrained nature of user-generated videos.

The proposed KWS was successfully used to detect swear words and names of products and persons in a test set of online user generated videos. The proposed temporal sentiment analysis system was also used successfully to identify at which points of time the user was speaking positively, negatively or neutrally about a product inside a video.



Unless otherwise indicated, the views expressed in this thesis are those of the author and do not necessarily represent the views of the University of Wollongong.