## Year

2013

## Degree Name

Doctor of Philosophy

## Department

School of Mathematics and Applied Statistics

## Recommended Citation

Hindmarsh, Diane M., Small area estimation for health surveys, Doctor of Philosophy thesis, School of Mathematics and Applied Statistics, University of Wollongong, 2013. https://ro.uow.edu.au/theses/3746

## Abstract

This thesis develops and evaluates Small Area Estimation (SAE) methods to provide estimates of prevalence rates of health risk factors for Local Government Areas (LGAs) in NSW using data from the NSW Population Health Survey. All outcome variables considered are dichotomous. The aim is to produce estimates that are an improvement over direct survey estimates based on a single year of data as well as over direct estimates based on data aggregated over seven years. Modified direct estimators, conventional synthetic and composite estimators, Empirical Best Linear Unbiased Predictors (EBLUP), complex synthetic estimators using a linear model, Empirical Best Predictors (EBP) and associated synthetic estimators based on the logistic model are assessed initially for the outcome variable ‘Current Smoking’ using 2006 survey data. All estimates are produced using SAS Version 9.2.

Model-based SAE methods using regression models and area level random effects are found to be the most effective approach to create unbiased LGA-level estimates for ‘Current Smoking’, and are successful in creating estimates with face-validity when based on a single year of data. Of the other methods assessed neither LGA-based weighting nor generalised regression (GREG) estimates are shown to improve the direct LGA-level estimates sufficiently for them to be more useful than the current direct estimates. Conventional synthetic and composite estimators produce over-smoothed LGA-level estimates. In addition the n¨aive estimates of the mean square error (MSE) of these estimators underestimate the bias, and estimation of the root mean square error (RMSE) is difficult.

The EBLUP and EBP estimates and their associated synthetic counterparts are created and evaluated for four key outcome variables (‘Current Smoking’, ‘Risk Alcohol Consumption’, ‘Overweight or obese’ and ‘Have difficulties getting health care when needed’), by sex, for survey years 2006, 2007 and 2008. These outcome variables differ in their overall prevalence rate and level of intraclass correlation. Included in the evaluation process is an assessment of the effect of covariate specification. The model-building process used to create specific and more general covariate specifications is discussed as part of the model development process, with six covariate specifications assessed for each sex-outcome-year model. The four outcomes differ in the most appropriate covariate specification. Estimates of root mean square error (RMSE) using output from the relevant SAS procedures are also compared with estimates of RMSE using parametric bootstrapping.

Logistic models are recommended for estimation purposes because although the logistic and linear estimates are very similar, for outcomes with a prevalence of less than 30% the linear model underestimates the RMSE by up to 50%. Including the LGA level random effect in the model does not affect the estimates markedly but avoids overstating the precision of the modelled estimates. Bootstrapped estimates of RMSE avoid the underestimation of the SAS-based RMSE for out-of-sample areas, but the remainder are relatively similar to those output from the SAS procedure.

The resultant model-based estimates are assessed for bias against design-unbiased direct estimates based on the same year of data. The RMSE and relative root mean square error (RRMSE) are compared against the standard error and relative standard error respectively of direct estimates based on seven years of data, as well as single years of data. Other comparisons include aggregating model-based estimates to the Health Area level and to the quintiles of socioeconomic disadvantage and comparing with direct estimates at these levels. Most of the EBP estimates have estimated RRMSE of less than 25% and a RMSE of less than 10%, and those that do not still show considerable improvement over direct estimates based on a single year of data. They are also an improvement over the estimates based on seven years of data and have the advantage of being based on the current year of data rather than an average over an extended period of time. Hence the EBP estimates based on a single year of data can provide useful estimates at the LGA level.

## FoR codes (2008)

010401 Applied Statistics, 010402 Biostatistics, 111706 Epidemiology, 111711 Health Information Systems (incl. Surveillance)

**Unless otherwise indicated, the views expressed in this thesis are those of the author and do not necessarily represent the views of the University of Wollongong.**