Degree Name

Doctor of Philosophy


School of Computing and Information Technology


Train delays are among the most complained events by the public communities in urban cities. Train delay prediction is critical for advanced traveler information systems (ATIS), which provides valuable information for enhancing the efficiency and effectiveness of intelligent transportation systems (ITS). However, the train delay prediction problem cannot be easily solved by modeling historical/static data from a single data source. A large amount of data is collected from sensor devices across the cyber-physical networks in the big data era. Multimodal transport management systems offer greater availability of various open data sources, such as General Transit Feed Specification (GTFS) static and real-time feeds. With the development of advanced machine learning techniques, a growing number of open data sources are playing more and more critical roles in planning and operation of transportation services. Recently, very few existing ‘big data’ methods meet the specific needs in railways.

This thesis emphasizes open traffic data modeling, analysis, and application for train delay prediction. More specifically, GTFS, with standard open-source data in both static and real-time formats, is being widely used in public transport planning and operation management. However, compared to other extensively studied data sources such as smart card data and GPS trajectory data, the GTFS data lacks proper investigation yet. Utilization of the GTFS data is challenging for both transport planners and researchers due to its difficulty and complexity of understanding, processing, and leveraging the raw data. This thesis proposes a GTFS data acquisition and processing framework to offer an efficient and effective benchmark tool for converting and fusing the GTFS data to a ready-to-use format. The contribution of this new framework will render great potential for wider applications and deeper researches. Secondly, we demonstrate a novel data-driven Primary Delay Prediction System (PDPS) framework, which combines General Transit Feed Specification (GTFS), Critical Point Search (CPS), and deep learning models to leverage the data fusion. Different from existing researches, we present a hybrid deep learning solution for predicting multi-step train delays. Our solution uses Long Short-Term Memory (LSTM) to generate the forecasts for train delays based on the delay causes, run-time delay, and dwell time delay. The LSTM tackles the tasks for long-term predictions of running time and dwell time with univariate and multivariate time series data, respectively. We present the performance of the standard LSTM and its variants applied in a novel architecture. Experimental results indicate that the proposed method has superior accuracy for long-term delay prediction.

Lastly, as the first work in this area in the world, we apply a real entropy for measuring the time series regularity and find approximated potential predictability on train delays. Different from the existing train delay studies that had strived to explore sophisticated algorithms, this study focuses on finding the bound of improvements on predicting multi-scenario train delays with different machine learning methods. Motivated by the observation of deep learning methods failing to improve the prediction performance if the delay occurs rarely, we present a novel augmented machine learning approach to improve the overall prediction accuracy further. Our solution proposes a rule-driven automation (RDA) method, including a delay status labeling (DSL) algorithm, and the resilience of section (RSE) and resilience of station (RST) indicators to generate the forecast for train delays. The experiment results demonstrate that the Random Forest based implementation of our RDA method (RF-RDA) can significantly improve the generalization ability of multivariate multi-step forecast models for multi-scenario train delay prediction. The proposed solution surpasses state-of-art baselines based on real-world traffic datasets, which treat various real-time delays differently. Even when the predictability of conventional deep learning methods decreases, the performance of our method is still acceptable for practical use to provide accurate forecasts.

FoR codes (2008)




Unless otherwise indicated, the views expressed in this thesis are those of the author and do not necessarily represent the views of the University of Wollongong.