July 19, 2019
For each cancer, we first defined the set of genes that are most associated to it according to the literature. To project each patient into this semantic space, we next determined if each patient had a mutation in the genes representing the disease semantic vector (e.g. NSCLC). For TCGA data we only counted non-Silent mutations and represented them as a binary number for each gene, i.e. 0 if the patient had no mutations in that gene and 1 if the patient had a non-Silent mutation in the gene. Each patient was then encoded in a binary vector with each member corresponding to a gene from the disease semantic vector. For example, lung adenocarcinoma was associated to 1,367 genes in our semantic space. A lung adenocarcinoma patient’s vector would them be composed of 1,367 binary numbers dictating if the gene is mutated or not in that patient. We then multiply these binary vectors with the semantic disease vector to obtain the patient’s projection in the disease space, which in effect replaces the binary number with the Semantic Association Score between the gene and the disease. Contrary to clustering patient samples by their mutation or structural variant data alone, our projected patient vectors clustered patients together into 22 groups with high patient-to-patient similarity. These clusters recapitulate canonical knowledge about the disease, e.g. Lung Adenocarcinoma patients form clusters that include EGFR-driven and KRAS-driven cohorts. We also see novel groups of patients driven by genes such as MET, STK11 and MALAT1. These clusters can be further stratified by their survival status and other clinical features. We validated our approach with a non-TCGA Mesothelioma cohort, revealing similarities in patient stratification regardless of the data source. This approach represents a dramatic shift in patient segmentation, delivering real-time grouping of patients and biomarker identification, which can accelerate clinical trial design and therapeutic development strategy.