Dataset Download
Access Note: Some cohorts require formal data access approval from their original repositories (EGA or dbGaP). For these, download raw data from the original repository and process it using our mRNA TPM processing pipeline.
Open-access cohorts can be downloaded directly using the links below. To obtain pre-processed gene expression data (pre-treatment mRNA TPM), please send a formal request to marinka@hms.harvard.edu (cc: wanxiang_shen@u.nus.edu) by completing our data request form.
Cohort | Cancer Type | Patients (R/NR) | Group | Reference | Accession ID | Download |
---|---|---|---|---|---|---|
IMmotion150 | KIRC | 165 (48/117) | Large cohort | McDermott et al. Nat Med, 2018 | EGA: EGAS00001002928 | Request from repository |
IMvigor210 | BLCA | 298 (68/230) | Large cohort | IMvigor210 Study Group. Lancet, 2017 | EGA: EGAS00001002556 | Request from repository |
Miao et al. | KIRC | 17 (5/12) | Small cohort | Miao et al. Science, 2018 | dbGaP: phs001493.v1.p1 | Request from repository |
Ravi et al. (SU2C-MARK) | NSCLC | 102 (38/64) LUAD; 25 (8/17) LUSC | Large & Small cohorts | Ravi et al. Nat Genet, 2023 | dbGaP: phs002822.v1.p1 | Request from repository |
Liu et al. | SKCM | 107 (41/66) | Large cohort | Liu et al. Nat Med, 2019 | dbGaP: phs000452.v3.p1 | Request from repository |
Van Allen et al. | SKCM | 39 (13/26) | Medium cohort | Van Allen et al. Science, 2015 | dbGaP: phs000452.v3.p1 | Request from repository |
Freeman et al. (MGH) | SKCM | 34 (12/22) | Medium cohort | Freeman et al. Cell Rep. Med, 2022 | dbGaP: phs002683.v1.p1 | Request from repository |
Zhao et al. | GBM | 25 (11/14) | Small cohort | Zhao et al. Nat Med, 2019 | SRA: PRJNA482620 | Clinical data, mRNA data |
Kim et al. | STAD | 45 (12/33) | Medium cohort | Kim et al. Nat Med, 2018 | ENA: PRJEB25780 | Clinical data, mRNA data |
Gide et al. | SKCM | 73 (40/33) | Medium cohort | Gide et al. Cancer Cell, 2019 | ENA: PRJEB23709 | Clinical data, mRNA data |
Riaz et al. | SKCM | 51 (10/41) | Medium cohort | Riaz et al. Cell, 2017 | BioProject: PRJNA356761 | Clinical data, mRNA data |
Hugo et al. | SKCM | 26 (14/12) | Small cohort | Hugo et al. Cell, 2016 | GEO: GSE78220 | Clinical data, mRNA data |
Rose et al. | BLCA | 89 (16/73) | Medium cohort | Rose et al. BJC, 2021 | GEO: GSE176307 | Clinical data, mRNA data |
Snyder et al. | BLCA | 21 (7/14) | Small cohort | Snyder et al. PLoS Med, 2017 | Zenodo: 10.5281/zenodo.546110 | Clinical data, mRNA data |
Choueiri et al. | KIRC | 16 (3/13) | Small cohort | Choueiri et al. Clin Cancer Res, 2016 | CRI iAtlas (Open) | Clinical data, mRNA data |
Additionally, we provide paired patient samples consisting of pre- and post-ICI treatment mRNA expression data. Since some patients have multiple post-treatment samples, the dataset includes a total of 86 pre-post treatment pairs involving 78 patients across three cohorts: Riaz (n=43), Freeman (n=27), and Gide (n=16). These patients were treated with PD-1 (n=71), CTLA-4 + PD-1 (n=9), or CTLA-4 (n=6).
Cohort | Cancer Type | Patients (R/NR) | Group | Pre-treatment data | Post-treatment data |
---|---|---|---|---|---|
Riaz(n=43), Freeman(n=27), Gide(n=16) | SKCM | 86(22/64) | Pre-Post treatment Pairs | Clinical data, mRNA data | Clinical data, mRNA data |
Model Download
We provide pre-trained and fine-tuned Compass models for specific use cases. Click the links below to download.
No. | Model | Description | Download |
---|---|---|---|
1 | PT Model | Base model pre-trained on pan-cancer TCGA transcriptomic datasets (33 cancer types), used for concept feature extraction. | Download |
2 | PFT Model | Partially fine-tuned model (PFT) on all ICI-patients (n = 1,133) for response prediction. | Download |
3 | LFT Model | Linear-probing fine-tuned model (LFT) on all ICI-patients (n = 1,133) for response prediction. | Download |
4 | Atezo Model | Multi-stage fine-tuned model (PFT->PFT) developed on bladder cancer patients (n = 354) for Atezolizumab response prediction. | Download |
5 | Ipi Model | Multi-stage fine-tuned model (PFT->LFT) developed on melanoma patients (n = 57) for Ipilimumab response prediction. | Download |
6 | Nivo Model | Multi-stage fine-tuned model (PFT->PFT) developed on melanoma patients (n = 105) for Nivolumab response prediction. | Download |
7 | Pembro Model | Multi-stage fine-tuned model (PFT->PFT) developed on melanoma patients (n = 120) for Pembrolizumab response prediction. | Download |
8 | Leave-Choueiri | PFT Model trained on 1,117 patients excluding the Choueiri cohort (16 patients). | Download |
9 | Leave-Miao | PFT Model trained on 1,116 patients excluding the Miao cohort (17 patients). | Download |
10 | Leave-Snyder | PFT Model trained on 1,112 patients excluding the Snyder cohort (21 patients). | Download |
11 | Leave-Zhao | PFT Model trained on 1,108 patients excluding the Zhao cohort (25 patients). | Download |
12 | Leave-SU2CLC2 | PFT Model trained on 1,108 patients excluding the Ravi-2 cohort (25 patients). | Download |
13 | Leave-Hugo | PFT Model trained on 1,107 patients excluding the Hugo cohort (26 patients). | Download |
14 | Leave-Allen | PFT Model trained on 1,094 patients excluding the Allen cohort (39 patients). | Download |
15 | Leave-MGH | PFT Model trained on 1,099 patients excluding the Freeman (MGH) cohort (34 patients). | Download |
16 | Leave-Kim | PFT Model trained on 1,088 patients excluding the Kim cohort (45 patients). | Download |
17 | Leave-Riaz | PFT Model trained on 1,082 patients excluding the Riaz cohort (51 patients). | Download |
18 | Leave-Rose | PFT Model trained on 1,044 patients excluding the Rose cohort (89 patients). | Download |
19 | Leave-Gide | PFT Model trained on 1,060 patients excluding the Gide cohort (73 patients). | Download |
20 | Leave-SU2CLC1 | PFT Model trained on 1,031 patients excluding the Ravi-1 cohort (102 patients). | Download |
21 | Leave-Liu | PFT Model trained on 1,026 patients excluding the Liu cohort (107 patients). | Download |
22 | Leave-IMmotion150 | PFT Model trained on 968 patients excluding the IMmotion150 cohort (165 patients). | Download |
23 | Leave-IMVigor210 | PFT Model trained on 835 patients excluding the IMVigor210 cohort (298 patients). | Download |
Other Materials
Below is a list of additional datasets, including gene ID mapping, cancer type encoding, input examples, high-level concepts, and more.
Data | Description | Download |
---|---|---|
Cancer Code | Encoding for 33 cancer types. | Download |
Gene Code | Encoding for 15,672 genes. | Download |
Concepts | Details of 44 high-level concepts, including their corresponding gene sets, genes, and references. | Download |
Gene ID Map | A comprehensive gene ID mapping file, including ENS IDs, gene names, gene types, and Entrez gene IDs. | Download |
Input TPM Example | An example dataset from Gide cohort for Compass input used in response prediction. The first column represents cancer types, while the remaining columns contain gene expression TPM values. | Download |
Input Clinical Data | Clinical information corresponding to the Compass Input Example dataset from Gide cohort. | Download |
PT-Training Example | Sample dataset for Compass model pre-training, used to train the model. | Download |
PT-Test Example | Sample dataset for testing model performance during the pre-training. | Download |
Toy Raw Counts | Example raw count data, used for illustrating the conversion from raw counts to TPM values. | Download |
Toy TPM | Example TPM data derived from raw counts, used for illustrating raw count to TPM value conversion. | Download |
Gencode v36 Annotation | The version 36 Gencode annotation file. | Download |
Code Download
Access our code from GitHub repositories: