Doctor of Philosophy
School of Mathematics and Applied Statistics
Samart, Klairung, Analysis of probabilistically linked data, Doctor of Philosophy thesis, School of Mathematics and Applied Statistics, University of Wollongong, 2011. http://ro.uow.edu.au/theses/3513
Probabilistic matching of records from different data sets is often used to create linked data sets for use in research in health, epidemiology, economics, demography and sociology. Clearly, this type of matching can lead to linkage errors, which in turn can lead to bias and increased variability when standard statistical estimation techniques are used with the linked data. Recently, an inferential framework for statistical modelling using probabilistically linked data has been defined, which has then been used to develop modified estimation methods for regression models based on the assumption that the correctly linked records are mutually uncorrelated. In real life, however, measurements are usually made on clusters of correlated statistical units, such as people in a family, patients in a hospital or students in a school, and when analyzing such data, linear mixed models are often used.
In this thesis we show how this inferential framework can be used to develop unbiased regression parameter estimates when fitting a linear mixed model to probabilistically linked data. Furthermore, since estimation of variance components is also an important objective when fitting a mixed model, we develop appropriate modifications to standard methods of variance components estimation in order to account for linkage error. In particular, we focus on three widely used methods of variance components estimation: analysis of variance (ANOVA), maximum likelihood (ML) and restricted maximum likelihood (REML). A simulation study investigates the bias and variability of parameter estimates obtained by methods developed in this work. Simulation results indicate that all methods developed here perform reasonably well.
An application to longitudinal modeling is further investigated. In this situation, we focus on fitting linear mixed models to linked longitudinal registers. That is, more than two registers are linked and linkage errors occur across the entire registers. The results from a simulation study illustrate the performance of this approach, and show that although there is improved efficiency compared to the naive method which ignores the linkage errors, there are some issues that still need further investigation and improvement.