Centre for Statistical & Survey Methodology Working Paper SeriesCopyright (c) 2016 University of Wollongong All rights reserved.
http://ro.uow.edu.au/cssmwp
Recent documents in Centre for Statistical & Survey Methodology Working Paper Seriesen-usMon, 02 May 2016 15:06:19 PDT3600Environmental informatics: uncertainty quantification in the environmental sciences
http://ro.uow.edu.au/cssmwp/124
http://ro.uow.edu.au/cssmwp/124Thu, 03 Oct 2013 19:59:49 PDT
This exposition of environmental informatics is an attempt to bring current thinking about uncertainty quantification to the environmental sciences. Environmental informatics is a term that I first heard being used by Bronwyn Harch of Australia's Commonwealth Scientific and Industrial Research Organisation to describe a research theme within her organisation. Just as bioinformatics has grown and includes biostatistics as a sub-discipline, environmental informatics, or EI, has the potential to be much broader than classical environmental statistics.
]]>
Noel CressieMulti-species SIR models from a dynamical bayesian perspective
http://ro.uow.edu.au/cssmwp/123
http://ro.uow.edu.au/cssmwp/123Thu, 03 Oct 2013 19:59:47 PDT
Multi-species compartment epidemic models, such as the multispecies SIR (susceptible-infectious-recovered) model, are extensions of classic SIR models, used to explore the transient dynamics of pathogens that infect multiple hosts in a large population. In this article, we propose a dynamical Bayesian hierarchical SIR (HSIR) model, to capture the stochastic or random nature of an epidemic process in a multi-species SIR (with recovered becoming susceptible again) dynamical setting, under hidden mass-balance constraints. We call this an MSIRB model. Different from a classic multi-species SIR model (which we call MSIRc), our approach imposes mass balance on the underlying.
]]>
Lili ZhuangA Bayesian multivariate analysis of children’s exposure to pesticides
http://ro.uow.edu.au/cssmwp/122
http://ro.uow.edu.au/cssmwp/122Thu, 03 Oct 2013 19:59:45 PDT
In this article, we present a multivariate Bayesian analysis of the relationships, in preschool children, between environmental pathways of exposure to a non-persistent pesticide, chlorpyrifos (CPF), and its corresponding biomarker in urine, trichloropyridinol (TCP). The analysis uses the three years of data from the Pesticide Exposures of Preschool Children Over Time (PEPCOT) study. Hierarchical Bayesian analysis of pathways of exposure has gained popularity in recent years, where missing and censored data are modeled, and measurement and regression errors are accounted for in a single hierarchical statistical model. Here we consider multivariate pathways, where CPF and its metabolite TCP are modeled jointly in the environmental media. In this article, we analyze each of the three years of the study, focusing on the within-year multivariate nature of the PEPCOT data set. We present the results in a way that allows for an easy comparison of the fitted parameters over time.
]]>
Noel CressieHierarchical statistical modeling of big spatial datasets using the exponential family of distributions
http://ro.uow.edu.au/cssmwp/121
http://ro.uow.edu.au/cssmwp/121Thu, 03 Oct 2013 19:59:42 PDT
Big spatial datasets are very common in scientific problems, such as those involving remote sensing of the earth by satellites, climate-model output, small-area samples from national surveys, and so forth. In this article, our interest lies primarily in very large, non-Gaussian datasets. We consider a hierarchical statistical model consisting of a conditional exponential family model for the data and an underlying (hidden) geostatistical process for some transformation of the (conditional) mean of the data model. Within this hierarchical model, dimension reduction is achieved by modeling the geostatistical process as a linear combination of a fixed number of spatial basis functions, which results in substantial computational speedups. These models do not rely on specifying a spatial-weights matrix, and no assumptions of homogeneity, stationarity, or isotropy are made. Our approach to inference using these models is empirical-Bayesian in nature. We develop maximum likelihood (ML) estimates of the unknown parameters using Laplace approximations in an expectation-maximization (EM) algorithm. We illustrate the performance of the resulting empirical hierarchical model using a simulation study. We also apply our methodology to analyze a remote sensing dataset of aerosol optical depth.
]]>
Aritra SenguptaSpatial fay-herriot models for small area estimation with functional covariates
http://ro.uow.edu.au/cssmwp/120
http://ro.uow.edu.au/cssmwp/120Thu, 03 Oct 2013 19:59:40 PDT
The Fay-Herriot (FH) model is widely used in small area estimation and uses auxiliary information to reduce estimation variance at undersampled locations. We extend the type of covariate information used in the FH model to include functional covariates, such as social-media search loads, or remote-sensing images (e.g., in crop-yield surveys). The inclusion of these functional covariates is facilitated through a two-stage dimension reduction approach that includes a Karhunen-Loeve expansion followed by stochastic search variable selection. Additionally, the importance of modeling spatial autocorrelation has recently been recognized in the FH model; our model utilizes the conditional autoregressive class of spatial models in addition to functional covariates. We demonstrate the effectiveness of our approach through simulation and through the analysis of American Community Survey data. We use Google Trends search curves as functional covariates to analyze changes in rates of household Spanish speaking in the eastern half of the United States.
]]>
Aaron PorterLocal spatial-predictor selection
http://ro.uow.edu.au/cssmwp/119
http://ro.uow.edu.au/cssmwp/119Thu, 03 Oct 2013 19:59:38 PDT
Consider the problem of spatial prediction of a random process from a spatial dataset. Global spatial-predictor selection provides a way to choose a single spatial predictor from a number of competing predictors. Instead, we consider local spatial-predictor selection at each spatial location in the domain of interest. This results in a hybrid predictor that could be considered global, since it takes the form of a combination of local predictors; we call this the locally selected spatial predictor. We pursue this idea here using the (empirical) deviance information as our criterion for (global and local) predictor selection. In a small simulation study, the relative performance of this combined predictor, relative to the individual predictors, is assessed.
]]>
Jonathan BradleyStatistical modeling of MODIS cloud data using the spatial random effects model
http://ro.uow.edu.au/cssmwp/118
http://ro.uow.edu.au/cssmwp/118Thu, 03 Oct 2013 19:59:36 PDT
Remote sensing of the earth by satellites yields datasets that can be massive in size. To overcome computational challenges, we make use of the reduced-rank Spatial Random Effects (SRE) model in our statistical analysis of cloud mask data from NASA’s Moderate Resolution Imaging Spectrora-diometer (MODIS) instrument on board NASA’s Terra satellite, launched in December 1999. A set of retrieval algorithms has been developed by members of the MODIS atmospheric team for detecting clouds. Clouds play an important role in climate studies, and hence an accurate quantification of the spatial distribution of clouds is necessary. In this paper, we build a statistical model for the underlying clear-sky-probability (or conversely, the cloud-probability) process, and we quantify the uncertainty in our predictions. We consider a hierarchical statistical model for analyzing the cloud data, where we postulate a hidden process for the probability of clear sky that makes use of the SRE model. Its advantages are considerable: It can represent many types of spatial behavior, it permits fast computations when datasets are very large, and it has attractive change-of-support properties.
]]>
Aritra SenguptaPotential gains from using unit level cost information in a model-assisted framework
http://ro.uow.edu.au/cssmwp/117
http://ro.uow.edu.au/cssmwp/117Thu, 03 Oct 2013 19:59:34 PDT
In developing the sample design for a survey we attempt to produce a good design for the funds available. Information on costs can be used to develop sample designs that minimise the sampling variance of an estimator of total for fixed costs. Improvements in survey management systems mean that it is now sometimes possible to estimate the cost of including each unit in the sample. This paper develops relatively simple approaches to determine whether the potential gains arising from using this unit level cost information are likely to be of practical use. It is shown that the key factor is the coefficient of variation of the costs relative to the coefficient of variation of the relative error on the estimated cost coefficients.
]]>
David SteelBootstrap p-values for cochran's Q, stuart and bowker tests
http://ro.uow.edu.au/cssmwp/116
http://ro.uow.edu.au/cssmwp/116Thu, 03 Oct 2013 19:59:33 PDT
Cochran’s Q assesses treatment differences in randomized block designs with binary data. We suggest using bootstrap p-values rather than p-values based on the chi-squared distribution for tests based on Q. These chi-squared p-values for Q are the only ones usually given in statistical software and can be inaccurate. The same approach allows improved p-values to be given for sparse two-way cross-classification data.
]]>
D BestDisease mapping via negative binomial M-quantile regression
http://ro.uow.edu.au/cssmwp/115
http://ro.uow.edu.au/cssmwp/115Thu, 03 Oct 2013 19:59:31 PDT
A new approach to ecological regression for disease mapping is introduced, based on semi- parametric M-quantile regression models. In particular, we define a Negative Binomial M-quantile model as an alternative to Empirical Bayes or fully Bayesian approaches to disease mapping. The area-level covariates used in ecological regression are usually measured with error, and the pro- posed M-quantile modelling approach is easily made robust against outlying data in the model covariates. Differences between the M-quantile model and the usual random effects models are discussed, and these alternative approaches are compared using the well-known Scottish Lip cancer data and a simulation experiment. The Lip Cancer data example shows that the Negative Binomial M-quantile model confirms results obtained by other methods, but also seems to have less shrinkage than the Empirical Bayes method, so reducing the problem of oversmoothing. The simulation experiment suggests that the new model leads to estimates with smaller mean square error. We also show how the Negative Binomial M-quantile can be extended to account for spatial correlation between areas using a Geographically Weighted Regression strategy.
]]>
Ray ChambersPoisson M-quantile regression for small area estimation
http://ro.uow.edu.au/cssmwp/114
http://ro.uow.edu.au/cssmwp/114Thu, 03 Oct 2013 19:59:29 PDT
A new approach to model-based small area estimation for count outcomes is proposed and used for estimating the average number of visits to physicians for Health Districts in Central Italy. The proposed small area predictor is based on defining a Poisson M-quantile model by extending the ideas in Cantoni & Ronchetti (2001) and Chambers & Tzavidis (2006). This predictor can be viewed as a semi-parametric outlier robust alternative to the more commonly used plug-in Empirical Best Predictor that is based on a Poisson generalised linear mixed model with Gaussian random effects. Results from the real data application and from a simulation experiment confirm that the proposed small area predictor has good robustness properties and can be more efficient than alternative small area predictors.
]]>
Nikos TzavidisCharacteristics of empirical zoning distributions for small area health data
http://ro.uow.edu.au/cssmwp/113
http://ro.uow.edu.au/cssmwp/113Thu, 03 Oct 2013 19:59:27 PDT
Many studies of health utilise a multilevel modelling framework and if individual level data are not available use ecological inference to obtain individual level parameter estimates using area-level data summaries, resulting in biased parameter estimates and increased variance. For these studies, the modifiable area unit problem means that the scale of the analysis and the zones used to aggregate the data affect the amount and direction of the bias and the increase in variance. To investigate the effects of scale and zoning, in this paper the distribution of the parameter estimates for over many sets of zones at the same scale (the zoning distribution) is obtained for parameter estimates from an ecological model at multiple scales of analysis. The distributions are typically symmetrical and unimodal and can be considered to follow a normal distribution. The estimated average parameter estimate (ecological average) displays systematic variation with scale and is related to √M -1. The variance of the distribution is related to the average number of observations in the areas. The implications of creating and using a zoning distributions are wide ranging as they allow the estimates for a given set of zones at the same or a different scales to be compared and assessed.
]]>
Sandy BurdenBias reduction for correlated linkage error
http://ro.uow.edu.au/cssmwp/112
http://ro.uow.edu.au/cssmwp/112Thu, 03 Oct 2013 19:59:25 PDT
Linked data sets are often multi-linked, i.e. they are created by matching records from three or more data sources. In such cases, probability-based methods for record linkage may lead to correlated linkage errors. Furthermore, it is often the case that not all records can be linked, due to the linking procedure not being able to find suitable matches in at least one of the data sources. This can be simply because the data source is a sample, and so does not contain the requisite matching records. More generally, however, the probability algorithm used to create the matches may not be able to find another record that meets the minimum criterion for matching. In this paper we develop methods for carrying out regression analysis using multilinked data that allow for both correlated linkage error as well as unlinked records. We also investigate the role of auxiliary information in this process, focussing on the situation where marginal distribution information from the data sets being linked is available. Our simulation results show that recently published bias reduction methods based on an assumption of independent linkage errors can lead to insufficient bias correction in the correlated case, and that a modified approach which allows for correlated linkage errors is superior. We also show that auxiliary marginal information about the data sets being linked can help further reduce the bias due to both non-linkage and linkage errors.
]]>
Gunky KimMaximum likelihood logistic regression with auxiliary information for probabilistically linked data
http://ro.uow.edu.au/cssmwp/111
http://ro.uow.edu.au/cssmwp/111Thu, 03 Oct 2013 19:59:23 PDT
Despite the huge potential benefits, any analysis of probabilistically linked data cannot avoid the problem of linkage errors. These errors occur when probability-based methods are used to link or match records from two or more distinct data sets corresponding to the same target population, and they can lead to biased analytical decisions when they are ignored. Previous studies aimed at resolving this problem have assumed that the analyst has access to all the information used in the data linkage process. In practice, however, most analysts are secondary analysts, with only partial access to information about the linkage error structure. As a consequence, our previous research has focused on using an estimating equations approach to develop bias correction methods for secondary analysis of probabilistically linked data. In this paper we extend this approach to maximum likelihood estimation, using the missing information principle to accommodate the more realistic scenario of dependent linkage errors in both linear and logistic regression settings. We also develop the maximum likelihood solution when population auxiliary information in the form of population summary statistics is available. We also show that the main advantage from inclusion of population summary information is to correct small sample bias.
]]>
Gunky KimCorrection factors for unbiased, efficient estimation and prediction of biomass from log-log allometric models
http://ro.uow.edu.au/cssmwp/110
http://ro.uow.edu.au/cssmwp/110Thu, 03 Oct 2013 19:59:21 PDT
Allometric relationships are commonly used to estimate average biomass of trees of a particular size and to predict biomass of individual trees based on an easily measured covariate variable such as stem diameter. They are typically power relationships which, for the purpose of data fitting, are transformed using natural logarithms to convert the model to its linear equivalent. Implementation of these equations to estimate the relationships and to predict biomass of new trees on the natural (i.e., actual) scale requires back-transforming the logarithmic predictions. Because these transformations involve non-linearity, care must be taken during this step to avoid bias. Several correction factors have been proposed in the literature for removing the gross bias in estimates, but their performance as predictors of biomass has not yet been examined. This is a very important problem, and here we review nine such correction factors in terms of their abilities to estimate biomass and predict biomass for new trees. We compare their performance by examining their bias and variability based on large datasets of above-ground biomass and stem diameter for eight species of harvested trees and shrubs in the genera Eucalyptus and Acacia (n = 102-365 individuals per species). We found that good estimates of average biomass turned out to be good predictors of biomass for new trees. The linear model fitted has log of the above-ground biomass as the response variable and log of the stem diameter as the covariate. The only exactly unbiased estimate among those considered was the uniform minimum variance unbiased (UMVU) estimate, which involves evaluating a confluent hypergeometric function to obtain its correction factor. Three alternative correction factors that are easy to compute also performed well. One of these minimises mean squared error and was found to result in low bias, low prediction bias, the lowest mean squared error, and the lowest mean squared prediction error among all correction factors examined.
]]>
David CliffordModel-assisted optimal allocation for planned domains using composite estimation
http://ro.uow.edu.au/cssmwp/109
http://ro.uow.edu.au/cssmwp/109Thu, 03 Oct 2013 19:59:19 PDT
This paper develops allocation methods for stratified sample surveys where small area estimates are a priority. Small areas are domains of interest with sample sizes too small to allow traditional direct estimation to be feasible. Composite estimation may then be used, to balance between using a grand mean estimate and an area-specific estimate for each small area. In this paper, we assume stratified sampling where small areas are strata. Similar to Longford (2006), we seek efficient allocations where the aim is to minimise a linear combination of the mean squared errors of composite small area estimators and of an estimator of the overall mean. Unlike Longford, we define mean-squared error in a model-assisted framework, allowing a more natural interpretation of results using an intra-class correlation parameter. This optimal allocation is only available analytically for a special case, and has the unappealing property that some strata may be allocated no sample. Some alternative allocations, including a power allocation with numerically optimized exponent, are found to perform nearly as well as the optimal allocation, but with better practical properties.
]]>
Wilford MolefeA faster and computationally more efficient REML (PX)EM algorithm for linear mixed models
http://ro.uow.edu.au/cssmwp/108
http://ro.uow.edu.au/cssmwp/108Thu, 03 Oct 2013 18:21:52 PDT
Residual maximum likelihood is the preferred method for estimating variance parameters associated with a linear mixed model. Typically an iterative algorithm is required for the estimation of these parameters. Two algorithms which can be used for this purpose are the EM algorithm and the PX-EM algorithm. Both require specification of the complete data which comprises the incomplete and missing data. We consider a new incomplete data specification which is computationally more efficient than alternative specifications. In the example considered the new incomplete data specification results in the algorithm converging in 30% fewer iterations than the alternative specification. We describe the conditions necessary for this faster rate of convergence to apply in other cases.
]]>
Simon DiffeyIncorporating household type in mixed logistic models for people in households
http://ro.uow.edu.au/cssmwp/107
http://ro.uow.edu.au/cssmwp/107Thu, 03 Oct 2013 18:21:48 PDT
Generalized linear mixed models (GLMMs), particularly the random intercept logistic regression model, are often used to model binary outcomes for people in households. A challenge in fitting these models is that the degree of dependency between co-householders often depends on the type of household, such as households of related people, households of unrelated people, and single person households. The use of a different variance component for each household type is investigated using two representative datasets, on voting behaviour and health risk factors and outcomes, and a simulation study. Variance components are found to be significantly different across household types in the examples. Models which ignore this understate covariate effects for household types with lower variance components, typically single person households.
]]>
Robert Graham ClarkWhat Level of Statistical Model Should We Use in Small Domain Estimation?
http://ro.uow.edu.au/cssmwp/106
http://ro.uow.edu.au/cssmwp/106Tue, 01 Oct 2013 15:42:49 PDT
If unit-level data are available, Small Area Estimation (SAE) is usually based on models formulated at the unit level, but they are ultimately used to produce estimates at the area level and thus involve area-level inferences. This paper investigates the circumstances when using an area-level model may be more effective. Linear mixed models fitted using different levels of data are applied in SAE to calculate synthetic estimators and Empirical Best Linear Unbiased Predictors (EBLUPs). The performance of area-level models is compared with unit-level models when both individual and aggregate data are available. A key factor is whether there are substantial contextual effects. Ignoring these effects in unit-level working models can cause biased estimates of regression parameters. The contextual effects can be automatically accounted for in the area-level models. Using synthetic and EBLUP techniques, small area estimates based on different levels of linear mixed models are studied in a simulation study.
]]>
Mohammad-Reza Namazi-RadMulti-phase variety trials using both composite and individual replicate samples: A model-based design approach
http://ro.uow.edu.au/cssmwp/105
http://ro.uow.edu.au/cssmwp/105Sun, 19 May 2013 18:58:04 PDT
This paper provides an approach for the design and analysis of variety trials that are used to obtain quality trait data. These trials are multi-phase in nature, comprising a field phase followed by one or more laboratory phases. Typically the laboratory phases are costly relative to the field phase and this necessitates a limit on the number of samples that can be tested. Historically, this has been achieved by sacrificing field replication, either by testing a single replicate plot for each variety or a single composite sample, obtained by combining material from several field replicates. An efficient statistical analysis cannot be applied to such data so that valid inference and accurate prediction of genetic effects is precluded. In this paper we propose an approach in which some varieties are tested using individual field replicate samples and others as composite samples. Replication in the laboratory is achieved by splitting a relatively small number of field samples into sub-samples for separate processing. We show that, if necessary, some of the composite samples may be split for this purpose. We also show that, given a choice of field compositing and laboratory replication strategy, an efficient design for a laboratory phase may be obtained using model-based techniques. The methods are illustrated using two examples.
]]>
Alison B. Smith