Year

2010

Degree Name

Masters by Research

Department

Faculy of Informatics

Abstract

Computer technology has penetrated virtually all aspects of modern life. One of the main reasons for the success of modern computers is their ability to store large amounts of information in central or distributed repositories, and with the development of modern data mining techniques, the ability in finding relevant information from those repositories. This has given rise to important forms of data stores such as databases and the World Wide Web (WWW). It is known that the typical size of data stores increases at an exponential rate over time. This is due to the fact that data collection is greatly simplified through computer technology, and due to the fact that storage devices become larger and cheaper all the time. As a consequence, the discovery or extraction of information in a data store is becoming an increasingly challenging task. For example, finding information on the World Wide Web, the largest known source of digital information storage, has become possible through efficient search engine designs. The scalability of web search engines is achieved through the distribution and parallelization of the search task, and through algorithms which scale at most linearly with the size of a dataset. Google, the largest of the web search engine companies, solves the problem through a partitioning of the WWW into chunks, and assigning a set of chunks to independent computers called chunk servers. Information contained in these chunks is indexed, and documents are ranked by an efficient algorithm called PageRank. This system allows the finding of documents which contain a given set of keywords within a fraction of a second despite the current size of Google’s database, containing an estimated 6 billion documents.

Much more challenging is the task of finding patterns in large sets of data. This exercise is commonly referred to as Data Mining. Data mining assumes no prior knowledge about the data in a dataset. This is a common situation arising out of data logging. Logging is the task of recording events for the purpose of keeping a trail (record) of past activities in a system. For example, sales in a supermarket are logged electronically at cash registers which record the sale of any item. The records stored normally include a name or description of the item, the sales price, the location of the sale, the time of sale, and perhaps other fields such as the name, ID number, of the salesperson, the identity of the buyer, etc. The analysis of such logged data can be utilized to analyse a variety of aspects concerning the system. For example, the analysis of data logs can help to draw a conclusion about the performance of a salesperson, help to discover irregularities which may be caused by a shortcoming of the process in the system or due to fraud. It can also serve as a marketing tool to help identify successful marketing strategies, or to draw conclusions about user behaviours. In other words, data mining often involves the task of discovering knowledge from unlabeled data. Logged data is often referred to as temporal data in order to account for potential temporal relationships between entries (i.e. some given sales occurred before some other given sales). The capturing of temporal dependencies is often an integral part of data mining and knowledge discovery tasks. There are very effective methods for discovering knowledge from temporal data. For example, hidden Markov models (HMM) are a popular approach to discover patterns from temporal data. However, HMMs do not scale favourably with the size of a given dataset, and hence, are not normally used for data mining applications of large sets of temporal data.

This thesis addresses knowledge discovery in a data mining environment. In this work, we present a new method in discovering patterns from a large set of unlabeled temporal data. Gaussian Mixture (GM) and hidden Markov model (HMM) form the core of our proposed approach. These methods are engaged to cluster temporal data by using a novel recursive GM_HMM model. This recursive GM_HMM model works as follows: it will start with the entire dataset, and will partition the dataset into clusters which are similar to one another using the GM technique. These will be unlabelled clusters, containing patterns which are similar to one another. Then, it will consider each cluster in turn, and attempt to take into consideration the context of each pattern, and use HMM to label them into different labelled sets. It will then consider each labelled set in turn, and then to find sub clusters of similar patterns thereof. This process of applying GM followed by HMM will be repeated for each cluster, until the clusters contain all similar patterns with same label. This recursive process produces a hierarchical model, the number of levels in the model is determined by the data. It is a data driven approach, with little a priori assumption about the number of levels in the hierarchical model, or the unlabelled data itself. It is conceivable that the method can be parallelized in that an independent machine can take care of a cluster, and can use the GM and HMM repeatedly to further partition the data, thus overcoming one of the major issues involved in using HMM in such applications, as the HMM will only operate on a partition of the data, rather than the entire dataset. We apply this methodology in finding patterns in a large scale database of logged information from Health Insurance Commission, Australia’s main health insurance provider.

The experimental results on a large set of health insurance data confirm the robustness of the patterns in grouping patients of similar medical behaviours into the same cluster. The approach works in an unsupervised fashion in that there is no requirement on the knowledge of the meaning of patterns which may be embedded in the data. The main significance of the approach lies in the ability to label large collections of data in a semi-automatic fashion. This is demonstrated in this thesis through an application to a large scale real world problem, viz., the logged dataset of health insurance record kept by the Health Insurance Commission of Australia.

02Whole.pdf (1545 kB)

Share

COinS
 

Unless otherwise indicated, the views expressed in this thesis are those of the author and do not necessarily represent the views of the University of Wollongong.