Bolstering heuristics for statistical validation of prediction algorithms
Machine learning research in image-based computer aided diagnosis is a field characterised by rich models and relatively small datasets. In this regime, conventional statistical tests for cross validation results may no longer be optimal due to variability in training set quality. We present a principle by which existing statistical tests can be conservatively extended to make use of arbitrary numbers of repeated experiments. We apply this to the problems of interval estimation and pair wise comparison for the accuracy of classification algorithms, and test the resulting procedures on real and synthetic classification tasks. The interval coverages in the synthetic task are notably improved, and the comparison has both increased power and reduced type I error. Experiments in the ADNI dataset show that the low replicability of split-half based tests can be dramatically improved.