Publication Details

Chen, H., Wang, L., Chi, C. & Shen, J. (2019). Leveraging SMOTE in A Two-Layer Model for Prediction of Protein-Protein Interactions. 2019 Seventh International Conference on Advanced Cloud and Big Data (CBD) (pp. 133-138). United States: IEEE.


The research of the mechanisms of infectious diseases between host and pathogens remains a hot topic. It takes stock of the interactions data between host and pathogens, including proteins and genomes, to facilitate the discoveries and prediction of underlying mechanisms. However, the incomplete protein-protein interactions data impediment the advances in this exploration and solicit the wet-lab experiments to examine and verify the latent interactions. Although there have been numerous studies trying to leverage the computational models, especially machine learning models, the performances of these models were not good enough to produce high-fidelity candidates of interactions data due to the nature of the proteinprotein interactions data. In this paper, we propose a two-layer model for prediction of host-pathogen protein-protein interactions tackling the challenges affiliated to the feature representation algorithms and the imbalanced data. The twolayer model consists of two essential modules, which are XGBoost to reduce the imbalanced ratio of the data and SVM to improve the performance. SMOTE technology is incorporated as a key component in our model to alleviate the bias of imbalanced ratio. In this study, we have carefully collected proteins interactions data from public databases and built a dataset following the protocol with consensus of literature. A variety of models, including traditional models, models in major literature and our model, are verified on the datasets. Results demonstrate that our model significantly improve the performance comparing with the other state-of-the-art models.



Link to publisher version (DOI)