Integrated statistical analysis of three data sources for the detection of chemical fingerprint features that are simultaneously associated with bio-assay and transcriptomics data
In modern drug discovery pipelines, a multitude of data is collected, including three important data sources: (1) chemical properties of the compounds being investigated, (2) bio-assay data for targets of interest, and more recently (3) microarray gene expression data. The major objective is the identification of compound fingerprint features that are associated with bioassay outcomes and/or gene expression. A particular challenge is the integrated statistical analysis of these multiple data sources. Such integrated analysis can reveal patterns or features of interest that may not be detected when analyzing the data sources two-by-two. However, no established methods exist for this type of data integration. We investigated the applicability of both an existing and a novel statistical methodology for this purpose. With sparse Canonical Correlation Analysis (Witten & Tibshirani, 2009) we could identify sets of compound features and genes that are clustered together. When in addition we used the bio-assay data as a supervising variable, this allowed the identification within a single statistical framework of patterns of interest that were previously only detected in the combined interpretation of a series of statistical analyses of the data sets two-by-two. We have developed a new method that allows "boosting" p-values of the association between chemistry and bioassay data, by incorporating the gene expression data obtained on the same compounds. This increases the power for detecting compound fingerprint features that are associated with the bio-assay outcome for which the pathway goes through the transcriptomics.