Doctor of Philosophy
School of Computer Science and Software Engineering
Zhang, Shujia, Impact sensitive ranking of structured documents, Doctor of Philosophy thesis, School of Computer Science and Software Engineering, University of Wollongong, 2011. http://ro.uow.edu.au/theses/3343
Ranking is an algorithm that defines an ordering among objects. This thesis will address the challenge of ranking structured documents. Traditional ranking methods involve link analysis, relevancy measure, or manual assessment; the aim is to rank the “important” document higher than the less “important” documents. The meaning of the term “important” would need to be defined. For example, one may rank the documents according to their popularity, by considering the number of documents which are linked to it. One may consider the “quality” of the documents, and rank the documents according to this “quality” criterion. Or one may consider the “relevancy” of the documents, and rank documents higher the more relevant they are to a given situation.
However, most of these methods are not taking into consideration the contents of the document. The ”popularity” and the contents of the document are actually inseparable in the study of document ranking according to an ”important” criterion. The content of document as well as its structural properties can be extracted and the structural properties can be expressed as a graph representation, involving nodes (representing word tokens), and links (links among the word tokens). Note that in this representation, the word tokens are merely “symbols”; they are not endowed with meanings, thus avoiding the tedious task of providing word tokens with meanings. The contextual information is inferred from the statistical properties of words which co-occurred in the same order, and in similar context within the given text corpus. This graph representation is called a concept link graph. In this thesis, we propose a novel representation called a graph of graphs (GoGs) representation to describe the structured documents, which may be connected by hyperlinks or citations. A GoGs is a hierarchy of graphs, the top level graph is the concept link graph, and the nodes in this top level concept link graph can be described by other concept link graphs which represent the documents which linked into the top level document, and so on. Such a graph of graphs representation can be very useful to model the inter-linking of documents by encoding the structural information at different levels, such as the relationships among documents, relationships among paragraphs within a document, relationships among words within a paragraph of a document, etc. This representation allows us to include document properties and contents to different depth, so that it can be used in different machine learning applications.
We then propose a novel machine learning method (graph neural networks for graph of graphs) GNN2 which is extended from a supervised learning method, viz., a graph neural network (GNN) . This model is capable of encoding GoGs without the need to pre-process the graph representation into a set of vectorial inputs first, and is applicable to both classification and regression problems.
This thesis will consider the ranking of large scale (hundreds of thousands) corpus of structured documents using the popularity criterion, but with consideration of content information, by deploying the GNN2 methods. The main questions which are studied include: what features and structures can be extracted from the documents; how to combine the extractions to encode the structures; what are the processing methods for such structural documents; and how these methods perform on different classification and clustering problems. Through investigating these questions, we conclude that the GNN2 and GoGs representation are a good combination for solving the ranking problem on a set of inter-linked documents.
This thesis presents a number of findings:
• A novel unsupervised machine learning method called probability mapping graph self organizing map (PMGraphSOM); this is capable of taking into consideration the structure of the documents, in a very similar manner to graph self organizing map technique, and this method is applied to a clustering task. This produced state-of-the-art results on a benchmark document clustering problem.
• The development of a novel approach based on Hidden Markov Models for the encoding of sequences of graphs data structures. The approach allows the encoding of a time series of structured objects.
• Application of the GNN2 algorithm to a classification task involving a large set of documents represented as GoGs.
• A combined system containing both the PMGraphSOM,MLP and GNN applied to a Web spam detection problem. This produced state-of-the-art results on a benchmark problem in Web spam detections.
• GNN2 is applied to encoding the temporal–spatial graphs with an aim to produce ranking on the documents in the original CiteSeer database. The temporal–spatial graph is an instance of GoGs which encodes three levels of relationships:
– Time-sequence: the status of the domain over time.
– Citation links: the referencing relationship between documents.
– Concept link: the contextual relationship among different concepts extracted from the documents.
This is a very useful description of the documents in the structured domain, hence can be trained by GNN2 to infer the underlying ranking function and to predict the ranking of documents with yet unknown ranks. This methodology is applied to the ranking of the documents in the original CiteSeer database. As there are no prior results in this area, it is not possible to assess if our results are good or bad.
In order to show that the proposed methodology is useful to the modeling of temporal–spatial problems involving graphs, we simulate a policemen database, in which we know the temporal–spatial variations. We applied the proposed methodology to this simulated dataset, and found that the results are highly accurate. Thus, this shows that the proposedmethodology can be applied to model temporal– spatial problems involving structures.
This thesis also provides some potential directions for future research:
• The PMGraphSOM is a state-of-the-art unsupervised learning algorithm for structured domains. This can be recommended for problems which require classification of structured representation of objects, e.g., documents, images, videos.
• The graph of graphs representation is a general temporal–spatial representation of structured domains. This can encode inter-linked documents, a set of images, or videos.
• The combined unsupervised and supervised approach can be recommended as a strategy for handling problems of deep learning, e.g., documents, handwritten character recognition, face recognition, human activity recognition, etc. In our case, we recommend a particular combination, viz., the PMGraphSOM and GNN combination. This combination has proven effective in the case of Web spam detection problems.
• The GNN2 algorithm can be applied for supervised learning involving graph of graph situations. If the problem is relatively noise free, in other words, if the noise contamination of the inputs is not high, this approach will yield good results.
This thesis also provide some potential directions for future research:
• Refine the PMGraphSOM for dealing with sparse graphs.
• Extend PMGraphSOM learning algorithm for clustering GoGs.
• Redevelop the code of GNN2 based on a clear software design specification, and to provide an associated user manual.
• Apply the proposed model on benchmark problems which provides ground truth information for the evaluation of impact sensitive ranking in GoGs situations.
• Apply the proposed methodology to a wider range of problems, e.g. video information retrieval problems, image retrieval problems, chemical informatics, etc.
• Develop different training algorithm for GNN2, e.g. map graphs to sequences, map graphs to graphs, etc.