Year

2012

Degree Name

Doctor of Philosophy

Department

School of Mathematics and Applied Statistics

Abstract

As the growth rate of data in biological databases continues to increase, researchers are provided with further opportunities to discover new findings from large amounts of complex biological information. To utilize invaluable biological information, this thesis has adopted, developed and applied statistical and computational methods to investigate the connections between the gene structure and its functions. The thesis uses the lengths of coding and non-coding regions (with and without introns), as well as gene sequences, as major representatives of the gene structure. The gene expression level and protein functions are considered functions of the length and sequence of genes. The investigation presented in this thesis attempts to understand two aspects of non-coding regions in different organisms through: (i) characterization of lengths of non-coding regions (5’UTR region and 3’UTR region) and (ii) characterization of sequences of the non-coding DNA (intron sequences).

A highly skewed distribution has been observed in the gene length and the gene expression level. This thesis proposes the mixture model to investigate statistical modeling of the length distribution of coding and non-coding regions. The mixture model demonstrates an appropriate estimation of parameters for the marginal distribution of the length of gene regions, which can adequately represent the gene length. In addition, quantile regression (QR) modeling, a robust tool applied due to the heavy tail distribution, is used to investigate how the gene length influences the gene expression level. QR provides the additional information necessary for estimating the entire family of specific quantile functions. The results of this study reveal the relationship between the un-translated regions (3’UTR and 5’UTR regions) and the gene expression level. The quantile characterization shows that the length of 3’UTR region has a larger impact on the gene expression level than does the length of 5’UTR region.

In relation to the gene sequences, this thesis introduces a new heuristic method, the “multi procedure” (MP), for identifying the change points in gene sequences in the form of a generalized Bernoulli process. The MP method is divided into eight procedures and implemented by the R tool. Our simulations and real data applications indicate that the proposed new method helps to improve the efficiency of identifying change points and to provide more accurate results in terms of the number of change points and the position of change points. The MP method is also applied to estimate change points in intron sequences to determine introns with structural changes. The potential information about structural changes in intron sequences may be relevant to functions of introns.

Share

COinS