EMLasso: Logistic lasso with missing data
In clinical settings, missing data in the covariates occur frequently. For example, some markers are expensive or hard to measure. When this sort of data is used for model selection, the missingness is often resolved through a complete case analysis or a form of single imputation. An alternative sometimes comes in the form of leaving the most damaged covariates out. All these strategies jeopardise the goal of model selection.In earlier work, we have applied the logistic Lasso in combination with multiple imputation to obtain results in such settings, but we only provided heuristic arguments to advocate the method. In this paper, we propose an improved method that builds on firm statistical arguments and that is developed along the lines of the stochastic expectation-maximisation algorithm. We show that our method can be used to handle missing data in both categorical and continuous predictors, as well as in a nonpenalised regression. We demonstrate the method by applying it to data of 273 lung cancer patients. The objective is to select a model for the prediction of acute dysphagia, starting from a large set of potential predictors, including clinical and treatment covariates as well as a set of single-nucleotide polymorphisms.