Degree Name

Doctor of Philosophy


School of Computer Science and Software Engineering


This dissertation proposes a novel methodology for knowledge discovery in large data sets, with a focus on unstructured and semi-structured textual data. To our knowledge, extracting knowledge from unstructured and semi-structured textual data is a major unsolved problem in the area of knowledge discovery in databases (KDD). The problem becomes particularly acute due to ambiguity and lexical variations in natural language. This thesis seeks to address these problems. Firstly, it proposes a unified methodology, called the Ontologybased Knowledge Discovery in unstructured and semi-structured Text (On-KDT) methodology, to discover knowledge from unstructured/semi-structured texts. This approach leverages semantic information encoded in ontologies to improve the effectiveness of the knowledge extraction processes. Secondly, the On-KDT methodology is validated in three distinct settings.

In the first setting, we extract scenarios from natural language software requirements. Extracting scenarios from natural language requirements helps improve the efficiency of the requirements process. In this study, the requirements for a courseregistration system are used as the case study. The On-KDT methodology is applied to extract scenarios describing three distinct components in the system.

In the second setting, we extract clinical knowledge from PubMed abstracts. PubMed is a very large collection of biomedical abstracts. To be able to make decisions that bring to bear the latest in biomedical research, clinicians need to read each of these. Searching and perusing such a huge repository is near impossible. In this study, PubMed abstracts relating to cervix cancer are used as the case study. In this dissertation, the On- KDT methodology is used to extract knowledge concerning clinical trials from PubMed abstracts. The knowledge thus extracted is represented in the Clinical Knowledge Markup Language (CKML). This approach has the potential to make effective use of relevant (and continually updated) medical knowledge contained PubMed abstracts possible, leading to potentially better clinical decisions.

In the third setting, we extract business rules from process model repositories, where process models are encoded as text artefact. Business rules encode important business constraints (including legislative and regulatory compliance constraints) as well as organizational policies. Organizations are often not adequately careful in documenting, encoding and maintaining repositories of their business rules. Instead, business rules are embedded in the design of a variety of operational artefacts, such as business process models. The ability to extract explicit business rules from such artefacts is important in order to be able to understand, analyse, leverage, deploy and maintain business rules. This dissertation provides an application of the On-KDT methodology in extracting business rules implicit in business process designs.

The empirical results reported in this dissertation provide grounds for confidence that the On-KDT methodology may be effective, not only in the settings described above, but potentially in knowledge extraction from other unstructured or semi-structured data repositories as well.

FoR codes (2008)

080107 Natural Language Processing, 080109 Pattern Recognition and Data Mining



Unless otherwise indicated, the views expressed in this thesis are those of the author and do not necessarily represent the views of the University of Wollongong.