Patient segmentation using machine-learning based literature and genomic data synthesis uncovers novel cohorts of NSCLC and mesothelioma patients

March 26,2021

July 19, 2019

Abstract: Current unbiased approaches to mine the large amounts of patient-level data on mutations, structural variations and gene expression result in an unwieldy amount of interactions and correlations, which cannot be parsed to identify disease drivers. Here we present an approach to encode mutational and structural variant data at a patient level in a semantic association space. This approach transforms the presence of a mutation (or other feature) in each patient into the quantitative semantic association score of the corresponding gene and the phenotype of interest, which we have trained on all publicly available literature using word-embedding neural networks. Using data from The Cancer Genome Atlas (TCGA), we encoded the mutation or structural variant status (incl. copy number, fusion and chromothripsis) of all patients in the Lung Adenocarcinoma and Mesothelioma cohorts into our semantic space. 

For each cancer, we first defined the set of genes that are most associated to it according to the literature. To project each patient into this semantic space, we next determined if each patient had a mutation in the genes representing the disease semantic vector (e.g. NSCLC). For TCGA data we only counted non-Silent mutations and represented them as a binary number for each gene, i.e. 0 if the patient had no mutations in that gene and 1 if the patient had a non-Silent mutation in the gene. Each patient was then encoded in a binary vector with each member corresponding to a gene from the disease semantic vector. For example, lung adenocarcinoma was associated to 1,367 genes in our semantic space. A lung adenocarcinoma patient’s vector would them be composed of 1,367 binary numbers dictating if the gene is mutated or not in that patient. We then multiply these binary vectors with the semantic disease vector to obtain the patient’s projection in the disease space, which in effect replaces the binary number with the Semantic Association Score between the gene and the disease. Contrary to clustering patient samples by their mutation or structural variant data alone, our projected patient vectors clustered patients together into 22 groups with high patient-to-patient similarity. These clusters recapitulate canonical knowledge about the disease, e.g. Lung Adenocarcinoma patients form clusters that include EGFR-driven and KRAS-driven cohorts. We also see novel groups of patients driven by genes such as MET, STK11 and MALAT1. These clusters can be further stratified by their survival status and other clinical features. We validated our approach with a non-TCGA Mesothelioma cohort, revealing similarities in patient stratification regardless of the data source. This approach represents a dramatic shift in patient segmentation, delivering real-time grouping of patients and biomarker identification, which can accelerate clinical trial design and therapeutic development strategy.

Enrique Garcia-Rivera1, Aaron S. Mansfield1, Karthik Murugadoss1, Murali Aravamudan1
1nference, Cambridge, MA 02142, USA
2Mayo Clinic, Rochester, MN 55905, USA
Correspondence: Enrique Garcia-Rivera (

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.