Degree Name

Doctor of Philosophy (PhD)


School of Information Technology and Computer Science - Faculty of Informatics


The information revolution is upon us. In fact, we are increasingly overwhelmed by the exponential growth of information on the Web. The profusion of resources on the Web has given rise to considerable interest in the research of information retrieval. Traditional information retrieval techniques are facing new challenges in distributed information environments such as the Internet. One of the more important research issues is information source selection, which is to select a small number of information sources that may contain most of the potentially useful documents when a user information need is presented. This thesis investigates new methodologies for information source selection in distributed information environments. We have identified potential selection cases within the context of distributed textual databases, and have classified the types of textual databases. The connection between selection cases and database types is analysed, and necessary constraints are given for each selection case. The above research results could be used as the guidance for developing effective database selection algorithms. A framework for a topic-based database selection system is proposed by the use of a topic hierarchy. In this framework, firstly, distributed textual databases are hierarchically categorised into a topic hierarchy for convenience of access and management. Secondly, two-stage database language models are presented to employ topic-based database selection within the context of the hierarchy of topics. At the category-specific search stage, a smoothed class-based language model is developed to determine the appropriate topic categories with respect to the user query. A number of databases associated with the chosen topics are selected as candidate databases for the next search stage. At the term-specific search stage, a smooth term-based language model is used to find the databases that are likely to contain the specified query terms. Finally, the original selection result is further refined by a set of topic-based association rules. These topic-based association rules contain useful information about the relationships between databases, which are extracted from a collection of previous selection results. To overcome the drawback of the keyword-based search, which treats words as independent of each other, ignoring potential semantic relationships between words, in this thesis, we propose a concept-based search mechanism to search distributed web databases using domain-specific ontologies. A domain-specific ontology provides rich information about the semantic relationships between concepts in a specific topic domain. This information is used for the generation of concept-related resource descriptions of web databases, query disambiguation and concept-based query matching in database selection.

02Whole.pdf (2373 kB)