Data mining: potential applications in research on nutrition and health
Aim: Data mining enables further insights from nutrition-related research, but caution is required. The aim of this analysis was to demonstrate and compare the utility of data mining methods in classifying a categorical outcome derived from a nutrition-related intervention.
Methods: Baseline data (23 variables, 8 categorical) on participants (n = 295) in an intervention trial were used to classify participants in terms of meeting the criteria of achieving 10 000 steps per day. Results from classification and regression trees (CARTs), random forests, adaptive boosting, logistic regression, support vector machines and neural networks were compared using area under the curve (AUC) and error assessments.
Results: The CART produced the best model when considering the AUC (0.703), overall error (18%) and within class error (28%). Logistic regression also performed reasonably well compared to the other models (AUC 0.675, overall error 23%, within class error 36%). All the methods gave different rankings of variables' importance. CART found that body fat, quality of life using the SF-12 Physical Component Summary (PCS) and the cholesterol: HDL ratio were the most important predictors of meeting the 10 000 steps criteria, while logistic regression showed the SF-12PCS, glucose levels and level of education to be the most significant predictors (P≤0.01).
Conclusions: Differing outcomes suggest caution is required with a single data mining method, particularly in a dataset with nonlinear relationships and outliers and when exploring relationships that were not the primary outcomes of the research.