We propose a new technique to prepare statistically-robust benchmarking data for evaluating chemical transport model meteorology and air quality parameters within the urban boundary layer. The approach employs atmospheric class-typing, using nocturnal radon measurements to assign atmospheric mixing classes, and can be applied temporally (across the diurnal cycle), or spatially (to create angular distributions of pollutants as a top-down constraint on emissions inventories). In this study only a short ( < 1-month) campaign is used, but grouping of the relative mixing classes based on nocturnal mean radon concentrations can be adjusted according to dataset length (i.e., number of days per category), or desired range of within-class variability. Calculating hourly distributions of observed and simulated values across diurnal composites of each class-type helps to: (i) bridge the gap between scales of simulation and observation, (ii) represent the variability associated with spatial and temporal heterogeneity of sources and meteorology without being confused by it, and (iii) provide an objective way to group results over whole diurnal cycles that separates 'natural complicating factors' (synoptic non-stationarity, rainfall, mesoscale motions, extreme stability, etc.) from problems related to parameterizations, or between-model differences. We demonstrate the utility of this technique using output from a suite of seven contemporary regional forecast and chemical transport models. Meteorological model skill varied across the diurnal cycle for all models, with an additional dependence on the atmospheric mixing class that varied between models. From an air quality perspective, model skill regarding the duration and magnitude of morning and evening "rush hour" pollution events varied strongly as a function of mixing class. Model skill was typically the lowest when public exposure would have been the highest, which has important implications for assessing potential health risks in new and rapidly evolving urban regions, and also for prioritizing the areas of model improvement for future applications.