University of Wollongong
Browse

Analysis of probabilistically linked data

Download (615.84 kB)
thesis
posted on 2024-11-11, 22:23 authored by Klairung Samart
Probabilistic matching of records from different data sets is often used to create linked data sets for use in research in health, epidemiology, economics, demography and sociology. Clearly, this type of matching can lead to linkage errors, which in turn can lead to bias and increased variability when standard statistical estimation techniques are used with the linked data. Recently, an inferential framework for statistical modelling using probabilistically linked data has been defined, which has then been used to develop modified estimation methods for regression models based on the assumption that the correctly linked records are mutually uncorrelated. In real life, however, measurements are usually made on clusters of correlated statistical units, such as people in a family, patients in a hospital or students in a school, and when analyzing such data, linear mixed models are often used. In this thesis we show how this inferential framework can be used to develop unbiased regression parameter estimates when fitting a linear mixed model to probabilistically linked data. Furthermore, since estimation of variance components is also an important objective when fitting a mixed model, we develop appropriate modifications to standard methods of variance components estimation in order to account for linkage error. In particular, we focus on three widely used methods of variance components estimation: analysis of variance (ANOVA), maximum likelihood (ML) and restricted maximum likelihood (REML). A simulation study investigates the bias and variability of parameter estimates obtained by methods developed in this work. Simulation results indicate that all methods developed here perform reasonably well. An application to longitudinal modeling is further investigated. In this situation, we focus on fitting linear mixed models to linked longitudinal registers. That is, more than two registers are linked and linkage errors occur across the entire registers. The results from a simulation study illustrate the performance of this approach, and show that although there is improved efficiency compared to the naive method which ignores the linkage errors, there are some issues that still need further investigation and improvement.

History

Year

2011

Thesis type

  • Doctoral thesis

Faculty/School

School of Mathematics and Applied Statistics

Language

English

Disclaimer

Unless otherwise indicated, the views expressed in this thesis are those of the author and do not necessarily represent the views of the University of Wollongong.

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC