Centre for Statistical & Survey Methodology Working Paper Series

Comparing and selecting spatial predictors using local criteria

Jonathan R. Bradley et al. — Tue, 27 Sep 2016 23:13:47 PDT

Remote sensing technology for the study of Earth and its environment has led to “Big Data” that, paradoxically, have global extent but may be spatially sparse. Furthermore, the variability in the measurement error and the latent process error may not fit conveniently into the Gaussian linear paradigm. In this paper, we consider the problem of selecting a predictor from a finite collection of spatial predictors of a spatial random process defined on D, a subset of d-dimensional Euclidean space. Critically, we make no statistical distributional assumptions other than additive measurement error. In this nonparametric setting, one could use a criterion based on a validation dataset to select a spatial predictor for all of D. Instead, we propose local criteria based on validation data to select a predictor at each spatial location in D; the result is a hybrid combination of the spatial predictors, which we call a locally selected predictor (LSP). We consider selection from a collection of some of the classical and more recently proposed spatial predictors currently available. In a simulation study, the relative performances of various LSPs, as well as the performance of each of the individual spatial predictors in the collection, are assessed. “Big Data” are always challenging, and here we apply LSP to a very large global spatial dataset of atmospheric CO2 measurements.

Constraint choice for spatial microsimulation

Sandy Burden et al. — Tue, 27 Sep 2016 22:49:51 PDT

Spatial microsimulation models are increasingly being used to create realistic microdata for geographical areas, to enable statistical modelling of health, social and economic variables in a wide variety of application areas. The models combine sample records with benchmark data for pre-defined geographic areas, typically by sampling, or re-weighting sample records to fit a set of constraints for each area. The choice of constraints is a key factor in producing microdata that reflect the population structure.

This paper introduces the use of within-area homogeneity for selecting categorical constraint variables for spatial microsimulation. The d-statistic is a measure of within-area homogeneity, that is equivalent to intra-area correlation for areas with equal population. It can be used to identify the spatial autocorrelation exhibited by the categories of constraint variables, or combinations of categories, an important feature to reproduce when modelling local variation in a variable. It may be used to assess the statistical significance of the within-area homogeneity for a given set of categories and can assist in validating spatial microsimulation models.

Environmental informatics: uncertainty quantification in the environmental sciences

Noel Cressie — Thu, 03 Oct 2013 19:59:49 PDT

This exposition of environmental informatics is an attempt to bring current thinking about uncertainty quantification to the environmental sciences. Environmental informatics is a term that I first heard being used by Bronwyn Harch of Australia's Commonwealth Scientific and Industrial Research Organisation to describe a research theme within her organisation. Just as bioinformatics has grown and includes biostatistics as a sub-discipline, environmental informatics, or EI, has the potential to be much broader than classical environmental statistics.

Multi-species SIR models from a dynamical bayesian perspective

Lili Zhuang et al. — Thu, 03 Oct 2013 19:59:47 PDT

Multi-species compartment epidemic models, such as the multispecies SIR (susceptible-infectious-recovered) model, are extensions of classic SIR models, used to explore the transient dynamics of pathogens that infect multiple hosts in a large population. In this article, we propose a dynamical Bayesian hierarchical SIR (HSIR) model, to capture the stochastic or random nature of an epidemic process in a multi-species SIR (with recovered becoming susceptible again) dynamical setting, under hidden mass-balance constraints. We call this an MSIRB model. Different from a classic multi-species SIR model (which we call MSIRc), our approach imposes mass balance on the underlying.

A Bayesian multivariate analysis of children’s exposure to pesticides

Noel Cressie et al. — Thu, 03 Oct 2013 19:59:45 PDT

In this article, we present a multivariate Bayesian analysis of the relationships, in preschool children, between environmental pathways of exposure to a non-persistent pesticide, chlorpyrifos (CPF), and its corresponding biomarker in urine, trichloropyridinol (TCP). The analysis uses the three years of data from the Pesticide Exposures of Preschool Children Over Time (PEPCOT) study. Hierarchical Bayesian analysis of pathways of exposure has gained popularity in recent years, where missing and censored data are modeled, and measurement and regression errors are accounted for in a single hierarchical statistical model. Here we consider multivariate pathways, where CPF and its metabolite TCP are modeled jointly in the environmental media. In this article, we analyze each of the three years of the study, focusing on the within-year multivariate nature of the PEPCOT data set. We present the results in a way that allows for an easy comparison of the fitted parameters over time.

Hierarchical statistical modeling of big spatial datasets using the exponential family of distributions

Aritra Sengupta et al. — Thu, 03 Oct 2013 19:59:42 PDT

Big spatial datasets are very common in scientific problems, such as those involving remote sensing of the earth by satellites, climate-model output, small-area samples from national surveys, and so forth. In this article, our interest lies primarily in very large, non-Gaussian datasets. We consider a hierarchical statistical model consisting of a conditional exponential family model for the data and an underlying (hidden) geostatistical process for some transformation of the (conditional) mean of the data model. Within this hierarchical model, dimension reduction is achieved by modeling the geostatistical process as a linear combination of a fixed number of spatial basis functions, which results in substantial computational speedups. These models do not rely on specifying a spatial-weights matrix, and no assumptions of homogeneity, stationarity, or isotropy are made. Our approach to inference using these models is empirical-Bayesian in nature. We develop maximum likelihood (ML) estimates of the unknown parameters using Laplace approximations in an expectation-maximization (EM) algorithm. We illustrate the performance of the resulting empirical hierarchical model using a simulation study. We also apply our methodology to analyze a remote sensing dataset of aerosol optical depth.

Spatial fay-herriot models for small area estimation with functional covariates

Aaron Porter et al. — Thu, 03 Oct 2013 19:59:40 PDT

The Fay-Herriot (FH) model is widely used in small area estimation and uses auxiliary information to reduce estimation variance at undersampled locations. We extend the type of covariate information used in the FH model to include functional covariates, such as social-media search loads, or remote-sensing images (e.g., in crop-yield surveys). The inclusion of these functional covariates is facilitated through a two-stage dimension reduction approach that includes a Karhunen-Loeve expansion followed by stochastic search variable selection. Additionally, the importance of modeling spatial autocorrelation has recently been recognized in the FH model; our model utilizes the conditional autoregressive class of spatial models in addition to functional covariates. We demonstrate the effectiveness of our approach through simulation and through the analysis of American Community Survey data. We use Google Trends search curves as functional covariates to analyze changes in rates of household Spanish speaking in the eastern half of the United States.

Local spatial-predictor selection

Jonathan Bradley et al. — Thu, 03 Oct 2013 19:59:38 PDT

Consider the problem of spatial prediction of a random process from a spatial dataset. Global spatial-predictor selection provides a way to choose a single spatial predictor from a number of competing predictors. Instead, we consider local spatial-predictor selection at each spatial location in the domain of interest. This results in a hybrid predictor that could be considered global, since it takes the form of a combination of local predictors; we call this the locally selected spatial predictor. We pursue this idea here using the (empirical) deviance information as our criterion for (global and local) predictor selection. In a small simulation study, the relative performance of this combined predictor, relative to the individual predictors, is assessed.

Statistical modeling of MODIS cloud data using the spatial random effects model

Aritra Sengupta et al. — Thu, 03 Oct 2013 19:59:36 PDT

Remote sensing of the earth by satellites yields datasets that can be massive in size. To overcome computational challenges, we make use of the reduced-rank Spatial Random Effects (SRE) model in our statistical analysis of cloud mask data from NASA’s Moderate Resolution Imaging Spectrora-diometer (MODIS) instrument on board NASA’s Terra satellite, launched in December 1999. A set of retrieval algorithms has been developed by members of the MODIS atmospheric team for detecting clouds. Clouds play an important role in climate studies, and hence an accurate quantification of the spatial distribution of clouds is necessary. In this paper, we build a statistical model for the underlying clear-sky-probability (or conversely, the cloud-probability) process, and we quantify the uncertainty in our predictions. We consider a hierarchical statistical model for analyzing the cloud data, where we postulate a hidden process for the probability of clear sky that makes use of the SRE model. Its advantages are considerable: It can represent many types of spatial behavior, it permits fast computations when datasets are very large, and it has attractive change-of-support properties.

Potential gains from using unit level cost information in a model-assisted framework

David Steel et al. — Thu, 03 Oct 2013 19:59:34 PDT

In developing the sample design for a survey we attempt to produce a good design for the funds available. Information on costs can be used to develop sample designs that minimise the sampling variance of an estimator of total for fixed costs. Improvements in survey management systems mean that it is now sometimes possible to estimate the cost of including each unit in the sample. This paper develops relatively simple approaches to determine whether the potential gains arising from using this unit level cost information are likely to be of practical use. It is shown that the key factor is the coefficient of variation of the costs relative to the coefficient of variation of the relative error on the estimated cost coefficients.

Bootstrap p-values for cochran's Q, stuart and bowker tests

D Best et al. — Thu, 03 Oct 2013 19:59:33 PDT

Cochran’s Q assesses treatment differences in randomized block designs with binary data. We suggest using bootstrap p-values rather than p-values based on the chi-squared distribution for tests based on Q. These chi-squared p-values for Q are the only ones usually given in statistical software and can be inaccurate. The same approach allows improved p-values to be given for sparse two-way cross-classification data.

Disease mapping via negative binomial M-quantile regression

Ray Chambers et al. — Thu, 03 Oct 2013 19:59:31 PDT

A new approach to ecological regression for disease mapping is introduced, based on semi- parametric M-quantile regression models. In particular, we define a Negative Binomial M-quantile model as an alternative to Empirical Bayes or fully Bayesian approaches to disease mapping. The area-level covariates used in ecological regression are usually measured with error, and the pro- posed M-quantile modelling approach is easily made robust against outlying data in the model covariates. Differences between the M-quantile model and the usual random effects models are discussed, and these alternative approaches are compared using the well-known Scottish Lip cancer data and a simulation experiment. The Lip Cancer data example shows that the Negative Binomial M-quantile model confirms results obtained by other methods, but also seems to have less shrinkage than the Empirical Bayes method, so reducing the problem of oversmoothing. The simulation experiment suggests that the new model leads to estimates with smaller mean square error. We also show how the Negative Binomial M-quantile can be extended to account for spatial correlation between areas using a Geographically Weighted Regression strategy.

Poisson M-quantile regression for small area estimation

Nikos Tzavidis et al. — Thu, 03 Oct 2013 19:59:29 PDT

A new approach to model-based small area estimation for count outcomes is proposed and used for estimating the average number of visits to physicians for Health Districts in Central Italy. The proposed small area predictor is based on defining a Poisson M-quantile model by extending the ideas in Cantoni & Ronchetti (2001) and Chambers & Tzavidis (2006). This predictor can be viewed as a semi-parametric outlier robust alternative to the more commonly used plug-in Empirical Best Predictor that is based on a Poisson generalised linear mixed model with Gaussian random effects. Results from the real data application and from a simulation experiment confirm that the proposed small area predictor has good robustness properties and can be more efficient than alternative small area predictors.

Characteristics of empirical zoning distributions for small area health data

Sandy Burden et al. — Thu, 03 Oct 2013 19:59:27 PDT

Many studies of health utilise a multilevel modelling framework and if individual level data are not available use ecological inference to obtain individual level parameter estimates using area-level data summaries, resulting in biased parameter estimates and increased variance. For these studies, the modifiable area unit problem means that the scale of the analysis and the zones used to aggregate the data affect the amount and direction of the bias and the increase in variance. To investigate the effects of scale and zoning, in this paper the distribution of the parameter estimates for over many sets of zones at the same scale (the zoning distribution) is obtained for parameter estimates from an ecological model at multiple scales of analysis. The distributions are typically symmetrical and unimodal and can be considered to follow a normal distribution. The estimated average parameter estimate (ecological average) displays systematic variation with scale and is related to √M - 1. The variance of the distribution is related to the average number of observations in the areas. The implications of creating and using a zoning distributions are wide ranging as they allow the estimates for a given set of zones at the same or a different scales to be compared and assessed.

Bias reduction for correlated linkage error

Gunky Kim et al. — Thu, 03 Oct 2013 19:59:25 PDT

Linked data sets are often multi-linked, i.e. they are created by matching records from three or more data sources. In such cases, probability-based methods for record linkage may lead to correlated linkage errors. Furthermore, it is often the case that not all records can be linked, due to the linking procedure not being able to find suitable matches in at least one of the data sources. This can be simply because the data source is a sample, and so does not contain the requisite matching records. More generally, however, the probability algorithm used to create the matches may not be able to find another record that meets the minimum criterion for matching. In this paper we develop methods for carrying out regression analysis using multilinked data that allow for both correlated linkage error as well as unlinked records. We also investigate the role of auxiliary information in this process, focussing on the situation where marginal distribution information from the data sets being linked is available. Our simulation results show that recently published bias reduction methods based on an assumption of independent linkage errors can lead to insufficient bias correction in the correlated case, and that a modified approach which allows for correlated linkage errors is superior. We also show that auxiliary marginal information about the data sets being linked can help further reduce the bias due to both non-linkage and linkage errors.

Maximum likelihood logistic regression with auxiliary information for probabilistically linked data

Gunky Kim et al. — Thu, 03 Oct 2013 19:59:23 PDT

Despite the huge potential benefits, any analysis of probabilistically linked data cannot avoid the problem of linkage errors. These errors occur when probability-based methods are used to link or match records from two or more distinct data sets corresponding to the same target population, and they can lead to biased analytical decisions when they are ignored. Previous studies aimed at resolving this problem have assumed that the analyst has access to all the information used in the data linkage process. In practice, however, most analysts are secondary analysts, with only partial access to information about the linkage error structure. As a consequence, our previous research has focused on using an estimating equations approach to develop bias correction methods for secondary analysis of probabilistically linked data. In this paper we extend this approach to maximum likelihood estimation, using the missing information principle to accommodate the more realistic scenario of dependent linkage errors in both linear and logistic regression settings. We also develop the maximum likelihood solution when population auxiliary information in the form of population summary statistics is available. We also show that the main advantage from inclusion of population summary information is to correct small sample bias.

Correction factors for unbiased, efficient estimation and prediction of biomass from log-log allometric models

David Clifford et al. — Thu, 03 Oct 2013 19:59:21 PDT

Allometric relationships are commonly used to estimate average biomass of trees of a particular size and to predict biomass of individual trees based on an easily measured covariate variable such as stem diameter. They are typically power relationships which, for the purpose of data fitting, are transformed using natural logarithms to convert the model to its linear equivalent. Implementation of these equations to estimate the relationships and to predict biomass of new trees on the natural (i.e., actual) scale requires back-transforming the logarithmic predictions. Because these transformations involve non-linearity, care must be taken during this step to avoid bias. Several correction factors have been proposed in the literature for removing the gross bias in estimates, but their performance as predictors of biomass has not yet been examined. This is a very important problem, and here we review nine such correction factors in terms of their abilities to estimate biomass and predict biomass for new trees. We compare their performance by examining their bias and variability based on large datasets of above-ground biomass and stem diameter for eight species of harvested trees and shrubs in the genera Eucalyptus and Acacia (n = 102-365 individuals per species). We found that good estimates of average biomass turned out to be good predictors of biomass for new trees. The linear model fitted has log of the above-ground biomass as the response variable and log of the stem diameter as the covariate. The only exactly unbiased estimate among those considered was the uniform minimum variance unbiased (UMVU) estimate, which involves evaluating a confluent hypergeometric function to obtain its correction factor. Three alternative correction factors that are easy to compute also performed well. One of these minimises mean squared error and was found to result in low bias, low prediction bias, the lowest mean squared error, and the lowest mean squared prediction error among all correction factors examined.

Model-assisted optimal allocation for planned domains using composite estimation

Wilford Molefe et al. — Thu, 03 Oct 2013 19:59:19 PDT

This paper develops allocation methods for stratified sample surveys where small area estimates are a priority. Small areas are domains of interest with sample sizes too small to allow traditional direct estimation to be feasible. Composite estimation may then be used, to balance between using a grand mean estimate and an area-specific estimate for each small area. In this paper, we assume stratified sampling where small areas are strata. Similar to Longford (2006), we seek efficient allocations where the aim is to minimise a linear combination of the mean squared errors of composite small area estimators and of an estimator of the overall mean. Unlike Longford, we define mean-squared error in a model-assisted framework, allowing a more natural interpretation of results using an intra-class correlation parameter. This optimal allocation is only available analytically for a special case, and has the unappealing property that some strata may be allocated no sample. Some alternative allocations, including a power allocation with numerically optimized exponent, are found to perform nearly as well as the optimal allocation, but with better practical properties.

A faster and computationally more efficient REML (PX)EM algorithm for linear mixed models

Simon Diffey et al. — Thu, 03 Oct 2013 18:21:52 PDT

Residual maximum likelihood is the preferred method for estimating variance parameters associated with a linear mixed model. Typically an iterative algorithm is required for the estimation of these parameters. Two algorithms which can be used for this purpose are the EM algorithm and the PX-EM algorithm. Both require specification of the complete data which comprises the incomplete and missing data. We consider a new incomplete data specification which is computationally more efficient than alternative specifications. In the example considered the new incomplete data specification results in the algorithm converging in 30% fewer iterations than the alternative specification. We describe the conditions necessary for this faster rate of convergence to apply in other cases.

Incorporating household type in mixed logistic models for people in households

Robert Graham Clark — Thu, 03 Oct 2013 18:21:48 PDT

Generalized linear mixed models (GLMMs), particularly the random intercept logistic regression model, are often used to model binary outcomes for people in households. A challenge in fitting these models is that the degree of dependency between co-householders often depends on the type of household, such as households of related people, households of unrelated people, and single person households. The use of a different variance component for each household type is investigated using two representative datasets, on voting behaviour and health risk factors and outcomes, and a simulation study. Variance components are found to be significantly different across household types in the examples. Models which ignore this understate covariate effects for household types with lower variance components, typically single person households.