Antigen presentation and tumor immunogenicity in cancer immunotherapy response prediction

Dependencies
Data download and preprocessing
TCGA Pan-cancer analyses
Immunotherapy datasets analyses
- TIGS and pan-cancer objective response rate to PD-1 inhibition
- Performance comparison on immunotherapy clinical response prediction and survival benefit
Supplementary analyses

Please note this work under Apache License v2 license, copyright belongs to Shixiang Wang and Xue-Song Liu. And the study has been applied for a national patent in China.

This document is compiled from an Rmarkdown file which contains all code or description necessary to reproduce the analysis for the accompanying project. Each section below describes a different component of the analysis and all numbers and figures are generated directly from the underlying data on compilation.

Dependencies

The preprocessing step is dependent on R software and some R packages, if you have not used R yet or R is not installed, please see https://cran.r-project.org/.

R packages:

UCSCXenaTools - download data from UCSC Xena
GEOquery - download data from NCBI GEO database
tidyverse - operate data, plot
data.table - operate data
survival - built in R, used to do survival analysis
metafor, metawho - meta-analysis
forestmodel - generate forestplot for meta-analysis model
forestplot - plot forestplot
survminer - plot survival fit
pROC - ROC analysis and visualization
TCGAmutations - download TCGA mutation data
DT - show data table as a table in html
GSVA - GSVA algorithm implementation
ggstatsplot - plot scatter with linear fit
corrplot - plot correlation
knitr, rmdformats - used to compile this file
readxl - read xlsx data
Some other dependent packages

These R packages are easily searched by internet and installed from either CRAN or Bioconductor, they have no strict version requirements to reproduce the following analyses.

Data download and preprocessing

This part will clearly describe how to process raw data but not provide an easy script for re-running all preprocess steps just by a click due to the complexity of data preprocessing.

TCGA Pan-cancer data download

TCGA Pan-cancer data (Version 2017-10-13), including datasets of clinical informaiton, gene expression, are downloaded from UCSC Xena via R package UCSCXenaTools.

UCSCXenaTools is developed by Shixiang and it is an R package to robustly access data from UCSC Xena data hubs. More please see paper: Wang et al., (2019). The UCSCXenaTools R package: a toolkit for accessing genomics data from UCSC Xena platform, from cancer multi-omics to single-cell RNA-seq. Journal of Open Source Software, 4(40), 1627, https://doi.org/10.21105/joss.01627

If you have not installed this package yet, please run following command in R.

install.packages("UCSCXenaTools")

Load the package with:

library(UCSCXenaTools)

# also load dplyr package
library(dplyr)

Obtain datasets information available at TCGA data hubs of UCSC Xena:

xe <- XenaGenerate(subset = XenaHostNames == "tcgaHub")
xe
#> class: XenaHub 
#> hosts():
#>   https://tcga.xenahubs.net
#> cohorts() (38 total):
#>   TCGA Ovarian Cancer (OV)
#>   TCGA Kidney Clear Cell Carcinoma (KIRC)
#>   TCGA Lower Grade Glioma (LGG)
#>   ...
#>   TCGA Colon Cancer (COAD)
#>   TCGA Formalin Fixed Paraffin-Embedded Pilot Phase II (FPPP)
#> datasets() (879 total):
#>   TCGA.OV.sampleMap/HumanMethylation27
#>   TCGA.OV.sampleMap/HumanMethylation450
#>   TCGA.OV.sampleMap/Gistic2_CopyNumber_Gistic2_all_data_by_genes
#>   ...
#>   TCGA.FPPP.sampleMap/miRNA_HiSeq_gene
#>   TCGA.FPPP.sampleMap/FPPP_clinicalMatrix

Obtain clinical datasets of TCGA:

xe %>% XenaFilter(filterDatasets = "clinical") -> xe_clinical
xe_clinical
#> class: XenaHub 
#> hosts():
#>   https://tcga.xenahubs.net
#> cohorts() (37 total):
#>   TCGA Ovarian Cancer (OV)
#>   TCGA Kidney Clear Cell Carcinoma (KIRC)
#>   TCGA Lower Grade Glioma (LGG)
#>   ...
#>   TCGA Colon Cancer (COAD)
#>   TCGA Formalin Fixed Paraffin-Embedded Pilot Phase II (FPPP)
#> datasets() (37 total):
#>   TCGA.OV.sampleMap/OV_clinicalMatrix
#>   TCGA.KIRC.sampleMap/KIRC_clinicalMatrix
#>   TCGA.LGG.sampleMap/LGG_clinicalMatrix
#>   ...
#>   TCGA.COAD.sampleMap/COAD_clinicalMatrix
#>   TCGA.FPPP.sampleMap/FPPP_clinicalMatrix

Obtain gene expression datasets of TCGA:

xe %>% XenaFilter(filterDatasets = "HiSeqV2_PANCAN$") -> xe_rna_pancan
xe_rna_pancan
#> class: XenaHub 
#> hosts():
#>   https://tcga.xenahubs.net
#> cohorts() (36 total):
#>   TCGA Ovarian Cancer (OV)
#>   TCGA Kidney Clear Cell Carcinoma (KIRC)
#>   TCGA Lower Grade Glioma (LGG)
#>   ...
#>   TCGA Uterine Carcinosarcoma (UCS)
#>   TCGA Colon Cancer (COAD)
#> datasets() (36 total):
#>   TCGA.OV.sampleMap/HiSeqV2_PANCAN
#>   TCGA.KIRC.sampleMap/HiSeqV2_PANCAN
#>   TCGA.LGG.sampleMap/HiSeqV2_PANCAN
#>   ...
#>   TCGA.UCS.sampleMap/HiSeqV2_PANCAN
#>   TCGA.COAD.sampleMap/HiSeqV2_PANCAN

Create data queries and download them:

xe_clinical.query <- XenaQuery(xe_clinical)
xe_clinical.download <- XenaDownload(xe_clinical.query,
  destdir = "UCSC_Xena/TCGA/Clinical", trans_slash = TRUE, force = TRUE
)

xe_rna_pancan.query <- XenaQuery(xe_rna_pancan)
xe_rna_pancan.download <- XenaDownload(xe_rna_pancan.query,
  destdir = "UCSC_Xena/TCGA/RNAseq_Pancan", trans_slash = TRUE
)

The RNASeq data we downloaded are pancan normalized.

For comparing data within independent cohort (like TCGA-LUAD), we recommend to use the “gene expression RNAseq” dataset. For questions regarding the gene expression of this particular cohort in relation to other types tumors, you can use the pancan normalized version of the “gene expression RNAseq” data. For comparing with data outside TCGA, we recommend using the percentile version if the non-TCGA data is normalized by percentile ranking. For more information, please see our Data FAQ: here.

These datasets are downloaded to local machine, we need to load them into R. However, whether filenames of datasets or contents in datasets all look messy, next we need to clean them before real analysis.

TCGA pan-cancer data clean

Clean filenames

First, clean filenames.

# set data directory where TCGA clinical and pancan RNAseq data stored
TCGA_DIR <- "UCSC_Xena/TCGA"
# obtain filenames of rna-seq data
dir(paste0(TCGA_DIR, "/RNAseq_Pancan")) -> RNAseq_filelist
# obtain tcga project code
sub("TCGA\\.(.*)\\.sampleMap.*", "\\1", RNAseq_filelist) -> project_code
# obtain filenames of clinical datasets
dir(paste0(TCGA_DIR, "/Clinical")) -> Clinical_filelist

Check.

head(RNAseq_filelist)
#> [1] "TCGA.ACC.sampleMap__HiSeqV2_PANCAN.gz" 
#> [2] "TCGA.BLCA.sampleMap__HiSeqV2_PANCAN.gz"
#> [3] "TCGA.BRCA.sampleMap__HiSeqV2_PANCAN.gz"
#> [4] "TCGA.CESC.sampleMap__HiSeqV2_PANCAN.gz"
#> [5] "TCGA.CHOL.sampleMap__HiSeqV2_PANCAN.gz"
#> [6] "TCGA.COAD.sampleMap__HiSeqV2_PANCAN.gz"

head(project_code)
#> [1] "ACC"  "BLCA" "BRCA" "CESC" "CHOL" "COAD"

head(Clinical_filelist)
#> [1] "TCGA.ACC.sampleMap__ACC_clinicalMatrix.gz"  
#> [2] "TCGA.BLCA.sampleMap__BLCA_clinicalMatrix.gz"
#> [3] "TCGA.BRCA.sampleMap__BRCA_clinicalMatrix.gz"
#> [4] "TCGA.CESC.sampleMap__CESC_clinicalMatrix.gz"
#> [5] "TCGA.CHOL.sampleMap__CHOL_clinicalMatrix.gz"
#> [6] "TCGA.COAD.sampleMap__COAD_clinicalMatrix.gz"

Obtain TCGA project abbreviations from https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-study-abbreviations and read as a data.frame in R.

library(readr)
TCGA_Study <- read_tsv("
                      LAML  Acute Myeloid Leukemia
                      ACC   Adrenocortical carcinoma
                      BLCA  Bladder Urothelial Carcinoma
                      LGG   Brain Lower Grade Glioma
                      BRCA  Breast invasive carcinoma 
                      CESC  Cervical squamous cell carcinoma and endocervical adenocarcinoma
                      CHOL  Cholangiocarcinoma
                      LCML  Chronic Myelogenous Leukemia
                      COAD  Colon adenocarcinoma
                      CNTL  Controls
                      ESCA  Esophageal carcinoma
                      FPPP  FFPE Pilot Phase II
                      GBM   Glioblastoma multiforme
                      HNSC  Head and Neck squamous cell carcinoma
                      KICH  Kidney Chromophobe
                      KIRC  Kidney renal clear cell carcinoma
                      KIRP  Kidney renal papillary cell carcinoma
                      LIHC  Liver hepatocellular carcinoma
                      LUAD  Lung adenocarcinoma
                      LUSC  Lung squamous cell carcinoma
                      DLBC  Lymphoid Neoplasm Diffuse Large B-cell Lymphoma
                      MESO  Mesothelioma
                      MISC  Miscellaneous
                      OV    Ovarian serous cystadenocarcinoma
                      PAAD  Pancreatic adenocarcinoma
                      PCPG  Pheochromocytoma and Paraganglioma
                      PRAD  Prostate adenocarcinoma
                      READ  Rectum adenocarcinoma
                      SARC  Sarcoma
                      SKCM  Skin Cutaneous Melanoma
                      STAD  Stomach adenocarcinoma
                      TGCT  Testicular Germ Cell Tumors
                      THYM  Thymoma
                      THCA  Thyroid carcinoma
                      UCS   Uterine Carcinosarcoma
                      UCEC  Uterine Corpus Endometrial Carcinoma
                      UVM   Uveal Melanoma", col_names = FALSE)
colnames(TCGA_Study) <- c("StudyAbbreviation", "StudyName")

Compare difference.

intersect(project_code, TCGA_Study$StudyAbbreviation) -> project_code
Clinical_filelist[rowSums(sapply(paste0("__", project_code, "_"), function(x) grepl(x, Clinical_filelist))) > 0] -> Clinical_filelist2

setdiff(Clinical_filelist, Clinical_filelist2)
#> [1] "TCGA.COADREAD.sampleMap__COADREAD_clinicalMatrix.gz"
#> [2] "TCGA.FPPP.sampleMap__FPPP_clinicalMatrix.gz"        
#> [3] "TCGA.GBMLGG.sampleMap__GBMLGG_clinicalMatrix.gz"    
#> [4] "TCGA.LUNG.sampleMap__LUNG_clinicalMatrix.gz"

Remove TCGA.FPPP.sampleMap__FPPP_clinicalMatrix.gz which has no RNAseq data. Remove other 3 datasets which merge more than one TCGA study, only keep individual study from TCGA.

Clean clinical datasets

Now, we read TCGA clinical datasets and clean them.

library(tidyverse)
#> ── Attaching packages ─────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
#> ✔ tibble  2.1.3     ✔ stringr 1.4.0
#> ✔ tidyr   1.0.0     ✔ forcats 0.4.0
#> ── Conflicts ────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
# keep valuable columns of clinical datasets
select_cols <- c(
  "sampleID", "OS", "OS.time", "OS.unit", "RFS", "RFS.time", "RFS.unit",
  "age_at_initial_pathologic_diagnosis", "gender", "tobacco_smoking_history",
  "tobacco_smoking_history_indicator", "sample_type", "pathologic_M",
  "pathologic_N", "pathologic_T", "pathologic_stage"
)

# read clinical files
Cli_DIR <- paste0(TCGA_DIR, "/Clinical")

#----------------------------------
# Load and preprocess clinical data
#----------------------------------
Clinical_List <- XenaPrepare(paste0(Cli_DIR, "/", Clinical_filelist2))
names(Clinical_List)
#>  [1] "TCGA.ACC.sampleMap__ACC_clinicalMatrix.gz"  
#>  [2] "TCGA.BLCA.sampleMap__BLCA_clinicalMatrix.gz"
#>  [3] "TCGA.BRCA.sampleMap__BRCA_clinicalMatrix.gz"
#>  [4] "TCGA.CESC.sampleMap__CESC_clinicalMatrix.gz"
#>  [5] "TCGA.CHOL.sampleMap__CHOL_clinicalMatrix.gz"
#>  [6] "TCGA.COAD.sampleMap__COAD_clinicalMatrix.gz"
#>  [7] "TCGA.DLBC.sampleMap__DLBC_clinicalMatrix.gz"
#>  [8] "TCGA.ESCA.sampleMap__ESCA_clinicalMatrix.gz"
#>  [9] "TCGA.GBM.sampleMap__GBM_clinicalMatrix.gz"  
#> [10] "TCGA.HNSC.sampleMap__HNSC_clinicalMatrix.gz"
#> [11] "TCGA.KICH.sampleMap__KICH_clinicalMatrix.gz"
#> [12] "TCGA.KIRC.sampleMap__KIRC_clinicalMatrix.gz"
#> [13] "TCGA.KIRP.sampleMap__KIRP_clinicalMatrix.gz"
#> [14] "TCGA.LAML.sampleMap__LAML_clinicalMatrix.gz"
#> [15] "TCGA.LGG.sampleMap__LGG_clinicalMatrix.gz"  
#> [16] "TCGA.LIHC.sampleMap__LIHC_clinicalMatrix.gz"
#> [17] "TCGA.LUAD.sampleMap__LUAD_clinicalMatrix.gz"
#> [18] "TCGA.LUSC.sampleMap__LUSC_clinicalMatrix.gz"
#> [19] "TCGA.MESO.sampleMap__MESO_clinicalMatrix.gz"
#> [20] "TCGA.OV.sampleMap__OV_clinicalMatrix.gz"    
#> [21] "TCGA.PAAD.sampleMap__PAAD_clinicalMatrix.gz"
#> [22] "TCGA.PCPG.sampleMap__PCPG_clinicalMatrix.gz"
#> [23] "TCGA.PRAD.sampleMap__PRAD_clinicalMatrix.gz"
#> [24] "TCGA.READ.sampleMap__READ_clinicalMatrix.gz"
#> [25] "TCGA.SARC.sampleMap__SARC_clinicalMatrix.gz"
#> [26] "TCGA.SKCM.sampleMap__SKCM_clinicalMatrix.gz"
#> [27] "TCGA.STAD.sampleMap__STAD_clinicalMatrix.gz"
#> [28] "TCGA.TGCT.sampleMap__TGCT_clinicalMatrix.gz"
#> [29] "TCGA.THCA.sampleMap__THCA_clinicalMatrix.gz"
#> [30] "TCGA.THYM.sampleMap__THYM_clinicalMatrix.gz"
#> [31] "TCGA.UCEC.sampleMap__UCEC_clinicalMatrix.gz"
#> [32] "TCGA.UCS.sampleMap__UCS_clinicalMatrix.gz"  
#> [33] "TCGA.UVM.sampleMap__UVM_clinicalMatrix.gz"
sub("TCGA\\.(.*)\\.sampleMap.*", "\\1", names(Clinical_List)) -> project_code

TCGA_Clinical <- tibble()
# for loop
for (i in 1:length(project_code)) {
  clinical <- names(Clinical_List)[i]
  project <- project_code[i]
  df <- Clinical_List[[clinical]]
  col_exist <- select_cols %in% colnames(df)
  res <- tibble()
  if (!all(col_exist)) {
    res <- df[, select_cols[col_exist]]
    res[, select_cols[!col_exist]] <- NA
  } else {
    res <- df[, select_cols]
  }
  res$Project <- project
  res %>% select(Project, select_cols) -> res
  TCGA_Clinical <- bind_rows(TCGA_Clinical, res)
}

rm(res, df, i, clinical, project, col_exist) # remove temp variables

View the clinical data, check variables.

DT::datatable(TCGA_Clinical,
  options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE
)

All unit are same in days, remove two column with redundant information.

TCGA_Clinical <- TCGA_Clinical %>% select(-c(OS.unit, RFS.unit))

Continue to tidy clinical datasets: rename variables, filter 14 samples with unusual sample types, rename variable value and . After cleanning, save the result object TCGA_Clinical.tidy for following analysis.

# create a new tidy dataframe
TCGA_Clinical.tidy <- TCGA_Clinical %>%
  rename(
    Age = age_at_initial_pathologic_diagnosis, Gender = gender,
    Smoking_history = tobacco_smoking_history,
    Smoking_indicator = tobacco_smoking_history_indicator,
    Tumor_Sample_Barcode = sampleID
  ) %>%
  filter(sample_type %in% c(
    "Solid Tissue Normal", "Primary Tumor", "Metastatic", "Recurrent Tumor",
    "Primary Blood Derived Cancer - Peripheral Blood"
  )) %>% # Additional - New Primary, Additional Metastatic, FFPE Scrolls total 14 sample removed
  mutate(Gender = case_when(
    Gender == "FEMALE" ~ "Female",
    Gender == "MALE" ~ "Male",
    TRUE ~ NA_character_
  ), Tumor_stage = case_when(
    pathologic_stage == "Stage 0" ~ "0",
    pathologic_stage %in% c("Stage I", "Stage IA", "Stage IB") ~ "I",
    pathologic_stage %in% c("Stage II", "Stage IIA", "Stage IIB", "Stage IIC") ~ "II",
    pathologic_stage %in% c("Stage IIIA", "Stage IIIB", "Stage IIIC") ~ "III",
    pathologic_stage %in% c("Stage IV", "Stage IVA", "Stage IVB", "Stage IVC") ~ "IV",
    pathologic_stage == "Stage X" ~ "X",
    TRUE ~ NA_character_
  )) %>%
  mutate(
    Gender = factor(Gender, levels = c("Male", "Female")),
    Tumor_stage = factor(Tumor_stage, levels = c("0", "I", "II", "III", "IV", "X"))
  )

if (!file.exists("results/TCGA_tidy_Clinical.RData")) {
  dir.create("results", showWarnings = FALSE)
  save(TCGA_Clinical.tidy, file = "results/TCGA_tidy_Clinical.RData")
}

Clean RNASeq datasets

Now, we read TCGA pan-cancer RNASeq datasets and clean them.

dir(paste0(TCGA_DIR, "/RNAseq_Pancan")) -> RNAseq_filelist.Pancan
RNAseq_filelist.Pancan[rowSums(sapply(paste0("TCGA.", project_code, ".sampleMap"), function(x) grepl(x, RNAseq_filelist.Pancan))) > 0] -> RNAseq_filelist.Pancan2

RNASeqDIR.pancan <- paste0(TCGA_DIR, "/RNAseq_Pancan")
RNASeq_List.pancan <- XenaPrepare(paste0(RNASeqDIR.pancan, "/", RNAseq_filelist.Pancan2))
names(RNASeq_List.pancan)
sapply(RNASeq_List.pancan, function(x) nrow(x))

names(RNASeq_List.pancan) <- sub("TCGA\\.(.*)\\.sampleMap.*", "\\1", names(RNASeq_List.pancan))

RNASeq_pancan <- purrr::reduce(RNASeq_List.pancan, full_join)
if (!file.exists("results/TCGA_RNASeq_PanCancer.RData")) {
  save(RNASeq_pancan, file = "results/TCGA_RNASeq_PanCancer.RData")
}


# class(RNASeq_pancan)
rm(RNASeq_List.pancan)

The result object RNASeq_pancan is very huge, it needs about 1.5GB space to store. We will not view it as a table in html page.

Selection of APM genes and marker genes of immune cell type

Marker genes for immune cell types were obtained from Senbabaoglu, Y. et al, selection of genes involved in processing and presentation of antigen (APM) was also inspired by this paper and finally 18 genes were selected based on literature review of APM and some facts that mutations or deletions affecting genes encoding APM (HLA, beta2-microglobulin, TAP1/2 etc.) may lead to reduced presentation of neoantigens by a cancer cell.

APM_genes <- read_csv("../data/APM.csv", skip = 1)

immune_cellType <- read_csv("../data/Immune_Cell_type_List.csv", skip = 1)
immune_cellType <- immune_cellType %>% filter(inRNAseq == "YES")

# merge two gene list
merged_geneList <- bind_rows(
  immune_cellType,
  tibble(
    Cell_type = "APM", Symbol = APM_genes$Gene_Name,
    Name = APM_genes$Protein_Name, inRNAseq = "YES"
  )
)

# save data
if (!file.exists("results/merged_geneList.RData")) {
  save(merged_geneList, file = "results/merged_geneList.RData")
}

View gene list.

DT::datatable(merged_geneList,
  options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE
)

Remove all variables.

rm(list = ls())

Calculation of APS, TIS and IIS

The GSVA package allows one to perform a change in coordinate systems of molecular measurements, transforming the data from a gene by sample matrix to a gene-set by sample matrix, thereby allowing the evaluation of pathway enrichment for each sample. This new matrix of GSVA enrichment scores facilitates applying standard analytical methods like functional enrichment, survival analysis, clustering, CNV-pathway analysis or cross-tissue pathway analysis, in a pathway-centric manner.

APM score (APS) and score of each immune cell type are directly calculated from GSVA method. Aggregate TIS (T cell infiltration score) and IIS (Immune cell infiltration score) are calculated using following method, their effectiveness have been validated by many studies.

The TIS was defined as the mean of the standardized values for the following T cell subsets: CD8 T, T helper, T, T central and effector memory, Th1, Th2, Th17, and Treg cells. The immune infiltration score (IIS) for a sample was defined as the mean of standardized values for macrophages, DC subsets (total, plasmacytoid, immature, activated), B cells, cytotoxic cells, eosinophils, mast cells, neutrophils, NK cell subsets (total, CD56 bright, CD56 dim), and all T cell subsets (CD8 T, T helper, T central and effector memory, Th1, Th2, Th17, and Treg cells). (Senbabaoglu, Y. et al)

Load function used to apply GSVA method.

source("../code/functions.R")

Apply GSVA to TCGA data. This step should run on a machine with big memory.

# load data
load("results/merged_geneList.RData")
load("results/TCGA_RNASeq_PanCancer.RData")


applyGSVA(merged_geneList, group_col = "Cell_type", gene_col = "Symbol", ExprMatList = list(RNASeq_pancan), method = "gsva") -> res_pancan.GSVA

save(res_pancan.GSVA, file = "results/res_pancan.GSVA.RData")

Calculate TIS and IIS.

if (!exists("res_pancan.GSVA")) {
  load("results/res_pancan.GSVA.RData")
}

res_pancan.GSVA <- res_pancan.GSVA[[1]]

calc_TisIIs <- function(df) {
  df %>%
    mutate(
      TIS = (`CD8 T cells` + `T helper cells` + `T cells` + `Tcm cells` + `Tem cells` + `Th1 cells` + `Th2 cells` + `Th17 cells` + `Treg cells`) / 9,
      IIS = (`CD8 T cells` + `T helper cells` + `T cells` + `Tcm cells` + `Tem cells` + `Th1 cells` + `Th2 cells` + `Th17 cells` + `Treg cells` + aDC + `B cells` + `Cytotoxic cells` + DC + Eosinophils + iDC + Macrophages + `Mast cells` + Neutrophils + `NK CD56bright cells` + `NK CD56dim cells` + `NK cells` + pDC) / 22
    ) -> df
}

# tidy res_pancan.GSVA
res_pancan.GSVA <- rownames_to_column(res_pancan.GSVA, var = "tsb")

# calculate TIS and IIS
gsva.pac <- calc_TisIIs(res_pancan.GSVA)

View this table (head 100 rows).

DT::datatable(head(gsva.pac, 100),
  options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE
)

save GSVA scores of APM and immune cell types, TIS and IIS.

if (!file.exists("results/gsva_tcga_pancan.RData")) {
  save(gsva.pac, file = "results/gsva_tcga_pancan.RData")
}

rm(list = ls())

Calculation of APS7

Here we update a new analysis about calculating APS based on 7 APM gene signature from Senbabaoglu, Y. et al. We name this score as ‘APS7’.

source("../code/functions.R")
load("results/merged_geneList.RData")
APS7.genes <- c(
  "B2M", "HLA-A", "HLA-B", "HLA-C",
  "TAP1", "TAP2", "TAPBP"
)

APS7 <- merged_geneList %>%
  dplyr::filter(Symbol %in% APS7.genes)

load("results/TCGA_RNASeq_PanCancer.RData")

res_APS7.GSVA <- applyGSVA(APS7, group_col = "Cell_type", gene_col = "Symbol", ExprMatList = list(RNASeq_pancan), method = "gsva")

res_APS7.GSVA <- res_APS7.GSVA[[1]]

save(res_APS7.GSVA, file = "results/res_APS7.GSVA.RData")

TIMER data download and clean

TIMER is a method that can accurately resolve relative fractions of diverse cell types based on gene expression profiles from complex tissues. We downloaded TIMER result for TCGA samples from Li, B. et al..

Load TIMER result and clean it for use.

timer <- read_tsv("../data/TCGA_sixcell_population_TIMER.txt")
timer$sample %>% length()
#> [1] 10009
timer$sample %>%
  unique() %>%
  length()
#> [1] 9990
timer$sample %>%
  substr(start = 1, stop = 12) %>%
  unique() %>%
  length()
#> [1] 9262

# keep only tumor samples
timer %>%
  filter(as.numeric(substr(sample, 14, 15)) %in% seq(1, 9)) %>%
  mutate(sample = substr(sample, 1, 15)) -> timer_clean
timer_clean %>%
  arrange(desc(T_cell.CD8)) %>%
  distinct(sample, .keep_all = TRUE) -> timer_clean

# save
if (!file.exists("results/timer.RData")) {
  save(timer_clean, file = "results/timer.RData")
}

rm(list = ls())

TCGA mutation download and clean

We downloaded TCGA somatic mutations (data source: TCGA MC3) via TCGAmutations package.

TCGAmutations is an R data package containing somatic mutations from TCGA cohorts. This is particularly useful for those working with mutation data from TCGA studies - where most of the time is spent on searching various databases, downloading, compiling and tidying up the data before even the actual analysis is started. This package tries to mitigate the issue by providing pre-compiled, curated somatic mutations from 33 TCGA cohorts along with relevant clinical information for all sequenced samples.

Load mutations of TCGA studies and merge them into one.

## install package please use following command
# devtools::install_github(repo = "PoisonAlien/TCGAmutations")

# load package
require(TCGAmutations)


study_list <- tcga_available()$Study_Abbreviation[-34]
cohorts <- system.file("extdata", "cohorts.txt", package = "TCGAmutations")
cohorts <- data.table::fread(input = cohorts)

# calculate TMB
lapply(study_list, function(study) {
  require(maftools)
  TCGAmutations::tcga_load(study)
  maf <- eval(as.symbol(tolower(paste0("TCGA_", study, "_mc3"))))
  maf.silent <- maf@maf.silent
  sample.silent <- maf.silent[, .N, .(Tumor_Sample_Barcode)]
  sample.nonsilent <- getSampleSummary(maf)
  res <- dplyr::full_join(sample.silent, sample.nonsilent, by = "Tumor_Sample_Barcode")
  res <- res %>%
    dplyr::mutate(
      TMB_Total = ifelse(!is.na(N), N + total, total),
      TMB_NonsynSNP = Missense_Mutation + Nonsense_Mutation,
      TMB_NonsynVariants = total
    ) %>%
    dplyr::select(TMB_Total:TMB_NonsynVariants, Tumor_Sample_Barcode)
  res
}) -> tcga_tmb
names(tcga_tmb) <- study_list

# 33 study available, merge them
TCGA_TMB <- purrr::reduce(tcga_tmb, bind_rows)

if (!file.exists("results/TCGA_TMB.RData")) {
  save(TCGA_TMB, file = "results/TCGA_TMB.RData")
}

# rm(list = grep("tcga_*", ls(), value = TRUE))
rm(list = ls())

GEO datasets download and processing

We also tried our best to collect gene expression data to extend APM score (APS), IIS etc. calculation for more cancer types by searching NCBI GEO databases. We only focus on looking for cancer types which showed in list of our immunotherapy clinical studies and not showed in TCGA studies. Finally, we found 5 GEO datasets which can extract gene expression data of Merkel Cell Carcinoma, Small Cell Lung Cancer and Cutaneous Squamous Carcinoma. Download of these datasets need R package GEOquery is installed. Following code showed how we downloaded and carefully processed them.

#----------------------------------------------------------------
# Purpose:
# Add APM data from GEO for some tumor types
# Include Merkel Cell Carcinoma and Small Cell Lung Cancer
#         and Cutaneous Squamous Carcinoma
#----------------------------------------------------------------
library(GEOquery)
library(tidyverse)

if (!dir.exists("../data/GEOdata")) {
  dir.create("../data/GEOdata")
}

geo_dir <- "../data/GEOdata"

GSE_39612 <- getGEO("GSE39612", GSEMatrix = TRUE, AnnotGPL = TRUE, destdir = geo_dir)
GSE_22396 <- getGEO("GSE22396", GSEMatrix = TRUE, AnnotGPL = TRUE, destdir = geo_dir)
GSE_36150 <- getGEO("GSE36150", GSEMatrix = TRUE, AnnotGPL = TRUE, destdir = geo_dir)
GSE_50451 <- getGEO("GSE50451", GSEMatrix = TRUE, AnnotGPL = TRUE, destdir = geo_dir)
GSE_99316 <- getGEO("GSE99316", GSEMatrix = TRUE, AnnotGPL = TRUE, destdir = geo_dir)

exprs(GSE_22396$GSE22396_series_matrix.txt.gz) %>% nrow()
pData(GSE_22396$GSE22396_series_matrix.txt.gz)
fData(GSE_22396$GSE22396_series_matrix.txt.gz) %>% nrow()

#---------------------------------------------------------
# process GSE_39612
gset1 <- GSE_39612$GSE39612_series_matrix.txt.gz
View(pData(gset1))

gset1_mcc <- gset1[, grep("MCC", gset1$title)]
gset1_scc <- gset1[, grep("SCC", gset1$title)]

rm(gset1)
# process GSE_22396
gset2 <- GSE_22396$GSE22396_series_matrix.txt.gz
View(pData(gset2))

# process GSE_36150
gset3 <- GSE_36150$GSE36150_series_matrix.txt.gz
View(pData(gset3))

# process GSE_50451
gset4_1 <- GSE_50451$`GSE50451-GPL570_series_matrix.txt.gz`
gset4_2 <- GSE_50451$`GSE50451-GPL571_series_matrix.txt.gz`

View(pData(gset4_1))
View(pData(gset4_2))
# gset4_1 all cell_lines, not use it
#
gset4_mcc <- gset4_2[, grep("MCC tumor", gset4_2$source_name_ch1)]
gset4_sclc <- gset4_2[, grep("SCLC tumor", gset4_2$source_name_ch1)]

rm(gset4_1, gset4_2)
# process GSE_99316
gset5_1 <- GSE_99316$`GSE99316-GPL10999_series_matrix.txt.gz`
gset5_2 <- GSE_99316$`GSE99316-GPL570_series_matrix.txt.gz`
gset5_3 <- GSE_99316$`GSE99316-GPL96_series_matrix.txt.gz`
gset5_4 <- GSE_99316$`GSE99316-GPL97_series_matrix.txt.gz`

View(pData(gset5_1))
View(pData(gset5_2))
View(pData(gset5_3))
View(pData(gset5_4))

# only gset5_2 has data we need
gset5_sclc <- gset5_2[, grep("SCLC", gset5_2$title)]

rm(gset5_1, gset5_2, gset5_3, gset5_4)

#-------------------------------------------------
# Now summary data
merkel_data <- list(gset1_mcc, gset2, gset3, gset4_mcc)
scc_data <- list(gset1_scc)
sclc_data <- list(gset4_sclc, gset5_sclc)

#--------------------
# apply GSVA method
load("results/merged_geneList.RData")

# transform exprset to tibble
genTibbleList <- function(gsetList) {
  stopifnot(is.list(gsetList), require("Biobase"), require("tidyverse"), class(gsetList[[1]]) == "ExpressionSet")

  res <- list()
  i <- 1
  for (gset in gsetList) {
    eset <- exprs(gset)
    # find gene symbol column
    fdata <- fData(gset)
    symbol_col <- grep("^gene.?symbol", colnames(fdata), value = TRUE, ignore.case = TRUE)

    if (length(symbol_col) == 0) {
      message("Find nothing about gene symbol in fData, try search it...")
      symbol_col2 <- grep("^gene_assignment", colnames(fdata), value = TRUE, ignore.case = TRUE)
      message("find ", symbol_col2)

      message("processing...")
      strlist <- strsplit(fdata[, symbol_col2], split = " // ")
      rowname <- sapply(strlist, function(x) trimws(x[2]))
      rownames(eset) <- rowname

      # stop("Something wrong with your fData of input List, please check it")
    }
    if (length(symbol_col) > 1) {
      warning("Multiple columns of fData match gene symbol, only use the first one")
      symbol_col <- symbol_col[1]
      rownames(eset) <- fdata[, symbol_col]
    } else if (length(symbol_col) == 1){
      rownames(eset) <- fdata[, symbol_col]
    }

    # remove duplicate rows, keep the one with biggest mean value
    eset %>%
      as.data.frame() %>%
      rownames_to_column() %>%
      mutate(
        mean_expr = rowMeans(.[, -1], na.rm = TRUE),
        rowname = sub("^(\\w+)\\..*", "\\1", rowname)
      ) %>%
      arrange(rowname, desc(mean_expr)) %>%
      distinct(rowname, .keep_all = TRUE) %>%
      select(-mean_expr) %>%
      as.tibble() -> res[[i]]

    i <- i + 1
  }
  return(res)
}

# apply GSVA method
applyGSVA <- function(group_df, group_col, gene_col, ExprMatList,
                      method = c("ssgsea", "gsva", "zscore", "plage"),
                      kcdf = c("Gaussian", "Poisson")) {
  stopifnot(inherits(group_df, "tbl_df") &
    inherits(group_col, "character") &
    inherits(gene_col, "character") &
    inherits(ExprMatList, "list"))
  if (!require(GSVA)) {
    stop("GSVA package need to be installed!")
  }

  method <- match.arg(method)
  kcdf <- match.arg(kcdf)

  require(dplyr)

  i <- 1
  resList <- list()
  groups <- names(table(group_df[, group_col]))
  gset_list <- lapply(groups, function(x) {
    group_df[group_df[, group_col] == x, gene_col] %>%
      unlist() %>%
      as.character()
  })

  names(gset_list) <- groups

  for (expr_mat in ExprMatList) {
    if (!inherits(expr_mat, "tbl_df")) {
      stop("element of ExprMatList should be tibble!")
    }
    expr_mat <- as.data.frame(expr_mat)
    rownames(expr_mat) <- expr_mat[, 1]
    expr_mat <- expr_mat[, -1] %>% as.matrix()

    res <- gsva(expr = expr_mat, gset.idx.list = gset_list, method = method, kcdf = kcdf)
    res <- as.data.frame(t(res))
    colnames(res)[1] <- "tsb"
    resList[[i]] <- res
    names(resList)[i] <- names(ExprMatList)[i]
    i <- i + 1
  }
  return(resList)
}


## scc
tibble.scc <- genTibbleList(scc_data)
gsva.scc <- applyGSVA(merged_geneList,
  group_col = "Cell_type",
  gene_col = "Symbol", ExprMatList = tibble.scc, method = "gsva"
)
## sclc
tibble.sclc <- genTibbleList(sclc_data)
gsva.sclc <- applyGSVA(merged_geneList,
  group_col = "Cell_type",
  gene_col = "Symbol", ExprMatList = tibble.sclc, method = "gsva"
)

## merkel1
tibble.merkel <- genTibbleList(merkel_data)

gsva.merkel <- applyGSVA(merged_geneList,
  group_col = "Cell_type",
  gene_col = "Symbol", ExprMatList = tibble.merkel, method = "gsva"
)

save(gsva.scc, gsva.sclc, gsva.merkel, file = "results/Add_gsva_scc_sclc_merkel.RData")

Immunotherapy clinical studies and genomics datasets

Immunotherapy clinical studies

Collecting response rate of immunotherapy clinical studies was inspired by the paper Tumor Mutational Burden and Response Rate to PD-1 inhibition which published on journal NEJM, this paper showed data from about 50 stuides. However, detail values of response rate the authors collected were not published. We followed their search strategy and tried our best to find all clinical studies which recorded response rate. Totally, we reviewed abstract of over 100 clinical studies, collected response rate values, then carefully filtered them based on standard we set, selected the most respresenting data (from about 60 studies) for downstream analysis. More detail of this procedure please see Method section of our manuscript. The data we collected for analysi are double checked and open to readers as a supplementary table.

Immunotherapy genomics datasets

To evaluate the predictive power of TIGS in ICI clinical response prediction, we searched PubMed for ICI clinical studies with available individual patient’s TMB and gene transcriptome information. In total three datasets are identified after this search. Van Allen et al 2015 dataset was downloaded from supplementary files of reference. This dataset studied CTLA-4 blockade in metastatic melanoma, and “clinical benefit” was defined using a composite end point of complete response or partial response to CTLA-4 blockade by RECIST criteria or stable disease by RECIST criteria with overall survival greater than 1 year, “no clinical benefit” was defined as progressive disease by RECIST criteria or stable disease with overall survival less than 1 year. Hugo et al 2016 dataset was downloaded from supplementary files of reference. This dataset studied Anti-PD-1therapy in metastatic melanoma, responding tumors were derived from patients who have complete or partial responses or stable disease in response to anti-PD-1 therapy, non-responding tumors were derived from patients who had progressive disease. Snyder et al dataset was downloaded from https://github.com/hammerlab/multi-omic-urothelial-anti-pdl1. This dataset studied PD-L1 blockade in urothelial cancer, and durable clinical benefit was defined as progression-free survival >6 months. RNA-Seq data was used to calculate the APS for each patient. Only patients with both APS and TMB value were used to calculate the TIGS. Median of TMB or TIGS was used as the threshold to separate TMB High/Low group or TIGS High/Low group in Kaplan-Meier overall survival curve analysis.

Detail of how to process these 3 datasets will be described at analysis part.

TCGA Pan-cancer analyses

This part will clearly describe how to analyze TCGA Pan-cancer data. Raw data used for TCGA pancan analyses have been preprocessed, detail please read preprocessing part of this analysis report.

Clean and combine data

Although data have been preprocessed in preprocessing part, they are still necessary to do some cleaning before really analyzing them according to our purpose.

library(tidyverse)

load("results/gsva_tcga_pancan.RData")
load("results/TCGA_tidy_Clinical.RData")

df.gsva <- full_join(TCGA_Clinical.tidy, gsva.pac, by = c("Tumor_Sample_Barcode" = "tsb"))

Only keep tumor samples, and filter sample which sample type is “0” or “X”. Number of samples with these two sample type are very few.

df.gsva.tumor <- df.gsva %>%
  filter(sample_type == "Primary Tumor", !Tumor_stage %in% c("0", "X")) %>%
  mutate(Tumor_stage = factor(Tumor_stage, levels = c("I", "II", "III", "IV")))

Totally, 9095 tumor samples have APS value.

Strong association between APS and immune cell infiltration level

We know that genes used for APS calculation does not overlap with genes of immune cell type, but we don’t know if there are association between APS and them. Besides, we also don’t know if there are association between APS and two aggregate scores: TIS and IIS.

UPDATE: Also, we also want to compare them with APS7 and PHBR scores from MHC-I Genotype Restricts the Oncogenic Mutational Landscape and Evolutionary Pressure against MHC Class II Binding Cancer Mutations. Therefore, we integrate them all before analyzing.

load("results/res_APS7.GSVA.RData")

res_APS7.GSVA <- res_APS7.GSVA %>%
  rownames_to_column(var = "Tumor_Sample_Barcode") %>%
  rename(APS7 = APM)

load(file = "../data/TCGA.PHBR_I.score.RData")
load(file = "../data/TCGA.PHBR_II.score.RData")

PHBR_I.score <- dplyr::tibble(
  Tumor_Sample_Barcode = paste0(names(PHBR_I.score), "-01"),
  PHBR_I = as.numeric(PHBR_I.score)
)

PHBR_II.score <- dplyr::tibble(
  Tumor_Sample_Barcode = paste0(names(PHBR_II.score), "-01"),
  PHBR_II = as.numeric(PHBR_II.score)
)

Combine them.

df.gsva.tumor <- purrr::reduce(
  list(df.gsva.tumor, res_APS7.GSVA, PHBR_I.score, PHBR_II.score),
  left_join
)
#> Joining, by = "Tumor_Sample_Barcode"
#> Joining, by = "Tumor_Sample_Barcode"
#> Joining, by = "Tumor_Sample_Barcode"

Association between GSVA scores and other presentation scores

Here we explore association across APS, APS7, IIS, PHBR_I, PHBR_II.

score_df <- df.gsva.tumor %>%
  select(Project, APM, APS7, IIS, PHBR_I, PHBR_II)

First we check pan-cancer level association.

cor_mat1 <- cor(score_df %>% select(-Project), method = "spearman", use = "pairwise.complete.obs")

Show heatmap.

breaksList <- seq(-1, 1, by = 0.01)
colnames(cor_mat1)[colnames(cor_mat1) == "APM"] <- "APS"
rownames(cor_mat1)[rownames(cor_mat1) == "APM"] <- "APS"

pheatmap::pheatmap(cor_mat1,
  display_numbers = TRUE,
  color = colorRampPalette(c("blue", "white", "red"))(length(breaksList)),
  breaks = breaksList
)

Second we check cancer type specific level association.

cor_mat2 <- score_df %>%
  group_by(Project) %>%
  nest() %>%
  slice(1) %>%
  mutate(cor = map(data, cor, method = "spearman", use = "pairwise.complete.obs"))

Here we get 32 correlation matrixes.

cor_mat2
#> # A tibble: 32 x 3
#> # Groups:   Project [32]
#>    Project           data cor              
#>    <chr>   <list<df[,5]>> <list>           
#>  1 ACC           [92 × 5] <dbl[,5] [5 × 5]>
#>  2 BLCA         [412 × 5] <dbl[,5] [5 × 5]>
#>  3 BRCA       [1,087 × 5] <dbl[,5] [5 × 5]>
#>  4 CESC         [308 × 5] <dbl[,5] [5 × 5]>
#>  5 CHOL          [36 × 5] <dbl[,5] [5 × 5]>
#>  6 COAD         [462 × 5] <dbl[,5] [5 × 5]>
#>  7 DLBC          [48 × 5] <dbl[,5] [5 × 5]>
#>  8 ESCA         [185 × 5] <dbl[,5] [5 × 5]>
#>  9 GBM          [602 × 5] <dbl[,5] [5 × 5]>
#> 10 HNSC         [528 × 5] <dbl[,5] [5 × 5]>
#> # … with 22 more rows

Output the result pheatmaps to PDF is a good option to observe results.

# old.par <- par(mfrow=c(1, 2))
pdf("Score_correlation_matrix.pdf")

for (i in seq_len(nrow(cor_mat2))) {
  message("Processing ", cor_mat2$Project[i])
  colnames(cor_mat2$cor[[i]])[colnames(cor_mat2$cor[[i]]) == "APM"] <- "APS"
  rownames(cor_mat2$cor[[i]])[rownames(cor_mat2$cor[[i]]) == "APM"] <- "APS"
  print(pheatmap::pheatmap(cor_mat2$cor[[i]],
    display_numbers = TRUE,
    cluster_rows = FALSE, cluster_cols = FALSE,
    main = cor_mat2$Project[i],
    color = colorRampPalette(c("blue", "white", "red"))(length(breaksList)),
    breaks = breaksList
  ))
}
dev.off()
# par(old.par)

Association between subsets of immune cells and APS

Next we answer the question: which cells (CIBERSORT scores) or subsets of the IIS scores are most associated with the APS scores.

The first step is to calculate correlation using spearman method.

df.gsva.heat <- df.gsva.tumor %>%
  select(-c(Tumor_Sample_Barcode:Tumor_stage, APS7, IIS, TIS, PHBR_I, PHBR_II)) %>%
  filter(!is.na(APM)) %>%
  select(Project, APM, everything())
heat_mat <- sapply(unique(df.gsva.heat$Project), function(x) {
  mat <- filter(df.gsva.heat, Project == x)
  mat <- mat[, -1]
  cor_mat <- cor(mat, method = "spearman", use = "pairwise.complete.obs")
  cor_mat[, 1]
})
heat_mat <- heat_mat[-1, ]

Show plot.

library(pheatmap)
breaksList <- seq(-1, 1, by = 0.01)

pheatmap(heat_mat,
  color = colorRampPalette(c("blue", "white", "red"))(length(breaksList)),
  breaks = breaksList
)

This shows that there is strong positive/consistent relationship between APM and immune/T cell infiltration level across TCGA cancer types. These spearman correlation coefficient are showed as a table.

DT::datatable(heat_mat,
  options = list(scrollX = TRUE, keys = TRUE), rownames = TRUE
)

Thanks to TCGA, we can obtain CIBERSORT results for TCGA patients from paper The Immune Landscape of Cancer.

cibersort <- data.table::fread("../data/TCGA.Kallisto.fullIDs.cibersort.relative.tsv.gz")

Combine APM (score) and CIBERSORT information, calculate correlation and plot heatmap.

cibersort_df <- left_join(
  df.gsva.tumor %>%
    select(Tumor_Sample_Barcode, Project, APM) %>%
    filter(!is.na(APM)),
  cibersort %>%
    select(-CancerType) %>%
    rename(Tumor_Sample_Barcode = SampleID) %>%
    mutate(Tumor_Sample_Barcode = substr(Tumor_Sample_Barcode, 1, 15)) %>%
    mutate(Tumor_Sample_Barcode = gsub("\\.", "-", Tumor_Sample_Barcode))
) %>% select(-c(Tumor_Sample_Barcode, P.value, Correlation, RMSE))
#> Joining, by = "Tumor_Sample_Barcode"

heat_mat2 <- sapply(unique(cibersort_df$Project), function(x) {
  mat <- filter(cibersort_df, Project == x)
  mat <- mat[, -1]
  cor_mat <- cor(mat, method = "spearman", use = "pairwise.complete.obs")
  cor_mat[, 1]
})
heat_mat2 <- heat_mat2[-1, ]

pheatmap(heat_mat2,
  color = colorRampPalette(c("blue", "white", "red"))(length(breaksList)),
  breaks = breaksList
)

Indeed, APM score correlate with T cell infiltration.

However, we see PHBR_I, PHBR_II does not have relationship with either APS or IIS, does this score correlate with immune cell infiltration?

Let’s check it.

cibersort_df <- left_join(
  df.gsva.tumor %>%
    select(Tumor_Sample_Barcode, Project, PHBR_I) %>%
    filter(!is.na(PHBR_I)),
  cibersort %>%
    select(-CancerType) %>%
    rename(Tumor_Sample_Barcode = SampleID) %>%
    mutate(Tumor_Sample_Barcode = substr(Tumor_Sample_Barcode, 1, 15)) %>%
    mutate(Tumor_Sample_Barcode = gsub("\\.", "-", Tumor_Sample_Barcode))
) %>% select(-c(Tumor_Sample_Barcode, P.value, Correlation, RMSE))
#> Joining, by = "Tumor_Sample_Barcode"

heat_mat2 <- sapply(unique(cibersort_df$Project), function(x) {
  mat <- filter(cibersort_df, Project == x)
  mat <- mat[, -1]
  cor_mat <- cor(mat, method = "spearman", use = "pairwise.complete.obs")
  cor_mat[, 1]
})
heat_mat2 <- heat_mat2[-1, ]

pheatmap(heat_mat2,
  color = colorRampPalette(c("blue", "white", "red"))(length(breaksList)),
  breaks = breaksList
)

Do the same analysis for PHBR_II score.

cibersort_df <- left_join(
  df.gsva.tumor %>%
    select(Tumor_Sample_Barcode, Project, PHBR_II) %>%
    filter(!is.na(PHBR_II)),
  cibersort %>%
    select(-CancerType) %>%
    rename(Tumor_Sample_Barcode = SampleID) %>%
    mutate(Tumor_Sample_Barcode = substr(Tumor_Sample_Barcode, 1, 15)) %>%
    mutate(Tumor_Sample_Barcode = gsub("\\.", "-", Tumor_Sample_Barcode))
) %>% select(-c(Tumor_Sample_Barcode, P.value, Correlation, RMSE))
#> Joining, by = "Tumor_Sample_Barcode"

heat_mat2 <- sapply(unique(cibersort_df$Project), function(x) {
  mat <- filter(cibersort_df, Project == x)
  mat <- mat[, -1]
  cor_mat <- cor(mat, method = "spearman", use = "pairwise.complete.obs")
  cor_mat[, 1]
})
heat_mat2 <- heat_mat2[-1, ]

pheatmap(heat_mat2,
  color = colorRampPalette(c("blue", "white", "red"))(length(breaksList)),
  breaks = breaksList
)

IIS is a good representation of immune cell infiltration

Immune/T cell infiltration is a good indicator for ICB response. Here we have seen that APS have strong association with both TIS and IIS. In indivial level, i.e. 9095 samples, we also observe that TIS and IIS are basically same because their correlation coeficient is about 0.91!

df.gsva.tumor %>%
  filter(!is.na(IIS)) %>%
  summarise(corr = cor(TIS, IIS, method = "spearman"))
#> # A tibble: 1 x 1
#>    corr
#>   <dbl>
#> 1 0.909

Therefore, we should choose one of them for downstream analysis. Finally, we choose IIS not only because it is more comprehensive, but also it is well validated.

In vitro validation with multiplex immunofluorescence, in silico validation using simulated mixing proportions and comparison between CIBERSORT and IIS have been previously described (Senbabaoglu, Y. et al).

TIMER is another method that can accurately resolve relative fractions of diverse cell types based on gene expression profiles from complex tissues. To further validate the calculated IIS, we perform TIMER analysis and find that the result of TIMER is highly correlated with the calculated IIS.

load("results/timer.RData")

df <- dplyr::select(df.gsva.tumor, Project, APM, IIS, APS7, PHBR_I, PHBR_II, Gender, Age, Tumor_stage, OS.time, OS, Tumor_Sample_Barcode)

df_timer <- dplyr::left_join(x = df, y = timer_clean, by = c("Tumor_Sample_Barcode" = "sample"))

library(corrplot)

mat_timer_IIS <- df_timer %>%
  select(-c(APM, APS7, PHBR_I, PHBR_II, Gender:Tumor_Sample_Barcode)) %>%
  filter(!is.na(IIS) & !is.na(T_cell.CD8)) %>%
  filter(T_cell.CD8 != 0)
mat_timer_IIS.heat <- sapply(unique(mat_timer_IIS$Project), function(x) {
  mat <- filter(mat_timer_IIS, Project == x)
  mat <- mat[, -1]
  cor_mat <- cor(mat, method = "spearman", use = "pairwise.complete.obs")
  cor_mat[, 1]
})

mat_timer_IIS.heat <- mat_timer_IIS.heat[-1, -22]

p.mat <- sapply(unique(mat_timer_IIS$Project), function(x) {
  mat <- filter(mat_timer_IIS, Project == x)
  mat <- mat[, -1]
  tryCatch(expr = {
    p_mat <- cor.mtest(mat, method = "spearman", conf.level = .95)
    p_mat[[1]][, 1]
  }, error = function(e) {
    rep(NA, 7)
  })
})

p.mat <- p.mat[-1, -22]
col <- colorRampPalette(c("blue", "white", "red"))(200)

corrplot(mat_timer_IIS.heat,
  method = "color", tl.col = "black", tl.srt = 45,
  col = col, p.mat = p.mat, sig.level = 0.05
)

The color bar in figure shows correlation coefficient values. The “X” marks relationship does not pass significant test (i.e. p>=0.05).

Even Senbabaoglu et al have checked the relationship between IIS and CIBERSORT in clear cell renal cell carcinoma. Here we expand it to all TCGA cancer type with CIBERSORT data and IIS avaiable.

df_cibersort <- left_join(
  df.gsva.tumor %>%
    select(Tumor_Sample_Barcode, Project, IIS) %>%
    filter(!is.na(IIS)),
  cibersort %>%
    select(-CancerType) %>%
    rename(Tumor_Sample_Barcode = SampleID) %>%
    mutate(Tumor_Sample_Barcode = substr(Tumor_Sample_Barcode, 1, 15)) %>%
    mutate(Tumor_Sample_Barcode = gsub("\\.", "-", Tumor_Sample_Barcode))
) %>% select(-c(Tumor_Sample_Barcode, P.value, Correlation, RMSE))
#> Joining, by = "Tumor_Sample_Barcode"


mat_cibersort_IIS.heat <- sapply(unique(df_cibersort$Project), function(x) {
  mat <- filter(df_cibersort, Project == x)
  mat <- mat[, -1]
  cor_mat <- cor(mat, method = "spearman", use = "pairwise.complete.obs")
  cor_mat[, 1]
})

pheatmap(heat_mat2[-1, ],
  color = colorRampPalette(c("blue", "white", "red"))(length(breaksList)),
  breaks = breaksList
)

Compare with timer, the relationship between IIS and immune cell subsets is more variable. I think this can be explained by several reasons:

CIBERSORT and TIMER target different aspects of tumor immune infiltrates. CIBERSORT infers the relative fractions of immune subsets in the total leukocyte population, while TIMER predicts the abundance of immune cells in the overall tumor microenvironment. Currently both methods are limited by the assumption that transcriptomes of tumor-infiltrating immune cells do not significantly differ from those collected from peripheral blood of healthy donors. (This is from Revisit linear regression-based deconvolution methods for tumor gene expression data)
Many cibersort values are equal to 0 or close to 0, this reduce the power to calculate correlation
CIBERSORT predicts 22 immune cell subsets while TIMER obtains 6 cell types, it is not unexpected to observe a more variable result using CIBERSORT.

Exploration of APS, TMB, TIGS at pan-cancer level

Load TCGA TMB data and merge all necessary datasets.

load("results/TCGA_TMB.RData")

df2 <- df %>%
  mutate(Tumor_Sample_Barcode = substr(Tumor_Sample_Barcode, 1, 12)) %>%
  arrange(APM) %>%
  distinct(Tumor_Sample_Barcode, .keep_all = TRUE)
tcga_all <- full_join(df2, TCGA_TMB, by = "Tumor_Sample_Barcode")

if (!file.exists("results/TCGA_ALL.RData")) {
  save(tcga_all, file = "results/TCGA_ALL.RData")
}

rm(list = ls())

Tumor mutation burden (TMB) is defined as the number of non-synonymous alterations per megabase (Mb) of genome examined. As reported previously, here we use 38 Mb as the estimate of the exome size. For studies reporting mutation number from whole exome sequencing, the normalized TMB = (whole exome non-synonymous mutation)/(38 Mb).

Original APM scores (APS) from GSVA are in the range of -1 to 1. To calculate tumor immunogenicity score (TIGS), original APM score from GSVA implementation is rescaled by minimal and maximal APM score from TCGA Pan-cancer analysis.

We calculate tumor immunogenicity score (TIGS) as following:

\[ TIGS = APS_{normalized} \times log(TMB + 1) \]

Natural logarithm is applied here. Of note, some tumors have TMB level below 1 mutation/ Mb, to avoid minus number in quantifying “tumor antigenicity”, we add number 1 to all normalized TMB.

How we generate this TIGS formula will be described at an individual part. Here we focus on association of APS, TMB and TIGS and their effects.

load("results/TCGA_ALL.RData")

tcga_all <- tcga_all %>%
  mutate(
    nAPM = (APM - min(APM, na.rm = TRUE)) / (max(APM, na.rm = TRUE) - min(APM, na.rm = TRUE)),
    nTMB = TMB_NonsynVariants / 38,
    TIGS = log(nTMB + 1) * nAPM
  ) %>%
  rename(Event = OS, Time = OS.time)


# keep samples with survival information
df_os <- tcga_all %>%
  filter(!is.na(Time), !is.na(Event))


df_os %>%
  filter(!is.na(nAPM)) %>%
  group_by(Project) %>%
  summarise(N = n()) %>%
  arrange(N)
#> # A tibble: 32 x 2
#>    Project     N
#>    <chr>   <int>
#>  1 CHOL       36
#>  2 DLBC       47
#>  3 UCS        56
#>  4 KICH       65
#>  5 ACC        79
#>  6 UVM        80
#>  7 MESO       85
#>  8 READ       92
#>  9 SKCM      102
#> 10 THYM      119
#> # … with 22 more rows

df_os %>%
  filter(!is.na(TIGS)) %>%
  group_by(Project) %>%
  summarise(N = n()) %>%
  arrange(N)
#> # A tibble: 32 x 2
#>    Project     N
#>    <chr>   <int>
#>  1 CHOL       36
#>  2 DLBC       36
#>  3 UCS        56
#>  4 KICH       65
#>  5 ACC        79
#>  6 MESO       80
#>  7 UVM        80
#>  8 READ       89
#>  9 SKCM      102
#> 10 THYM      118
#> # … with 22 more rows

Distribution of APS, TMB and IIS across TCGA studies

Calculate median values to sort distribution.

#------------ TIGS, TMB, APM pancan, sort by value
df_summary <- df_os %>%
  group_by(Project) %>%
  summarise(
    medianAPM = median(APM, na.rm = TRUE),
    medianTMB = median(TMB_NonsynVariants, na.rm = TRUE),
    medianTIGS = median(TIGS, na.rm = TRUE),
    medianAPMn = median(nAPM, na.rm = TRUE),
    medianTMBn = log(median(nTMB, na.rm = TRUE) + 1)
  )

Distribution of APM score (APS) across TCGA studies.

library(scales)
library(ggpubr)

df_os %>%
  filter(!is.na(Project), !is.na(APM)) %>%
  ggboxplot(
    x = "Project", y = "APM", color = "Project", add = "jitter", xlab = "TCGA Projects",
    ylab = "APM Score", add.params = list(size = 0.6),
    legend = "none"
  ) +
  rotate_x_text(angle = 45) +
  geom_hline(yintercept = mean(df_os$APM, na.rm = TRUE), linetype = 2) +
  scale_x_discrete(limits = arrange(df_summary, medianAPM) %>% .$Project) -> p_apm
p_apm

Distribution of TMB across TCGA studies.

df_os %>%
  filter(!is.na(Project), !is.na(TMB_NonsynVariants)) %>%
  ggboxplot(
    x = "Project", y = "TMB_NonsynVariants", color = "Project", add = "jitter", xlab = "TCGA Projects",
    ylab = "No. of Coding Somatic Nonsynonymous Mutation", add.params = list(size = 0.6),
    legend = "none"
  ) +
  rotate_x_text(angle = 45) +
  geom_hline(yintercept = mean(df_os$TMB_NonsynVariants, na.rm = TRUE), linetype = 2) +
  scale_y_log10(breaks = 10^(-1:4), labels = trans_format("log10", math_format(10^.x))) +
  scale_x_discrete(limits = arrange(df_summary, medianTMB) %>% .$Project) -> p_tmb
p_tmb

Distribution of TIGS across TCGA studies.

df_os %>%
  filter(!is.na(Project), !is.na(TIGS)) %>%
  ggboxplot(
    x = "Project", y = "TIGS", color = "Project", add = "jitter", xlab = "TCGA Projects",
    ylab = "TIGS", add.params = list(size = 0.6),
    legend = "none"
  ) +
  rotate_x_text(angle = 45) +
  geom_hline(yintercept = mean(df_os$TIGS, na.rm = TRUE), linetype = 2) +
  scale_x_discrete(limits = arrange(df_summary, medianTIGS) %>% .$Project) -> p_tigs
p_tigs

Correlation between TMB and IIS

df_TMB <- df_os %>%
  filter(!is.na(IIS), !is.na(TMB_NonsynVariants)) %>%
  mutate(logTMB = log(nTMB + 1))

ggstatsplot::ggscatterstats(
  data = df_TMB,
  x = nTMB,
  y = IIS,
  xlab = "No. of Coding Somatic Nonsynonymous Mutation",
  ylab = "IIS Score",
  point.size = 1,
  # title = "Correlation between APM and IIS score in pancancer",
  messages = FALSE, type = "spearman"
)
#> Registered S3 methods overwritten by 'broom.mixed':
#>   method         from 
#>   augment.lme    broom
#>   augment.merMod broom
#>   glance.lme     broom
#>   glance.merMod  broom
#>   glance.stanreg broom
#>   tidy.brmsfit   broom
#>   tidy.gamlss    broom
#>   tidy.lme       broom
#>   tidy.merMod    broom
#>   tidy.rjags     broom
#>   tidy.stanfit   broom
#>   tidy.stanreg   broom
#> Registered S3 methods overwritten by 'survival':
#>   method       from   
#>   nobs.coxph   insight
#>   nobs.survreg insight
#> Registered S3 methods overwritten by 'car':
#>   method                          from
#>   influence.merMod                lme4
#>   cooks.distance.influence.merMod lme4
#>   dfbeta.influence.merMod         lme4
#>   dfbetas.influence.merMod        lme4


ggstatsplot::ggscatterstats(
  data = df_TMB,
  x = logTMB,
  y = IIS,
  xlab = "No. of Coding Somatic Nonsynonymous Mutation (log)",
  ylab = "IIS Score",
  point.size = 1,
  # title = "Correlation between APM and IIS score in pancancer",
  messages = FALSE, type = "spearman"
)


plot_scatter <- function(data, x, y, xlab = "Median APM", ylab = "Median IIS", conf.int = TRUE, method = "spearman", label.x = -0.5, label.y = 0.2, label = "Project", ...) {
  ggscatter(data,
    x = x, y = y,
    xlab = xlab, ylab = ylab,
    shape = 21, size = 3, color = "black",
    add = "reg.line", add.params = list(color = "blue", fill = "lightgray"),
    conf.int = conf.int,
    cor.coef = TRUE,
    cor.coeff.args = list(method = method, label.x = label.x, label.y = label.y, label.sep = "\n"),
    label = label, repel = TRUE, ...
  )
}

df_project <- df_TMB %>%
  group_by(Project) %>%
  summarise(
    TMB_median = median(nTMB, na.rm = TRUE),
    IIS_median = median(IIS, na.rm = TRUE)
  )

mean(df_TMB$nTMB)
#> [1] 4.307116
plot_scatter(df_project,
  x = "TMB_median", y = "IIS_median",
  xlab = "Median TMB", ylab = "Median IIS", label.x = 0.2
) +
  geom_hline(yintercept = 0, linetype = 2) + geom_vline(xintercept = 4.35, linetype = 2)

Significant correlation between APM score and IIS

ggstatsplot::ggscatterstats(
  data = df_os %>% filter(!is.na(APM)),
  x = APM,
  y = IIS,
  xlab = "APM Score",
  ylab = "IIS Score",
  point.size = 1,
  # title = "Correlation between APM and IIS score in pancancer",
  messages = FALSE, type = "spearman"
)


df_project2 <- df_os %>%
  filter(!is.na(APM)) %>%
  group_by(Project) %>%
  summarise(
    APM_median = median(APM),
    IIS_median = median(IIS)
  )

plot_scatter(df_project2,
  x = "APM_median", y = "IIS_median",
  xlab = "Median APM", ylab = "Median IIS"
) +
  geom_hline(yintercept = 0, linetype = 2) + geom_vline(xintercept = 0, linetype = 2)

Survival analysis

Check pan-cancer level influence of APS, TMB and TIGS using unvariable cox model.

library(survival)
fit_APS <- coxph(Surv(Time, Event) ~ nAPM, data = filter(df_os, !is.na(nAPM)))
fit_TMB <- coxph(Surv(Time, Event) ~ log(nTMB), data = filter(df_os, !is.na(nTMB)))
fit_TIGS <- coxph(Surv(Time, Event) ~ TIGS, data = filter(df_os, !is.na(TIGS)))

fit_APS
#> Call:
#> coxph(formula = Surv(Time, Event) ~ nAPM, data = filter(df_os, 
#>     !is.na(nAPM)))
#> 
#>         coef exp(coef) se(coef)     z        p
#> nAPM 0.37300   1.45208  0.07837 4.759 1.94e-06
#> 
#> Likelihood ratio test=22.69  on 1 df, p=1.9e-06
#> n= 8949, number of events= 2589
fit_TMB
#> Call:
#> coxph(formula = Surv(Time, Event) ~ log(nTMB), data = filter(df_os, 
#>     !is.na(nTMB)))
#> 
#>              coef exp(coef) se(coef)     z      p
#> log(nTMB) 0.17413   1.19021  0.01306 13.33 <2e-16
#> 
#> Likelihood ratio test=167.4  on 1 df, p=< 2.2e-16
#> n= 9476, number of events= 2781
fit_TIGS
#> Call:
#> coxph(formula = Surv(Time, Event) ~ TIGS, data = filter(df_os, 
#>     !is.na(TIGS)))
#> 
#>         coef exp(coef) se(coef)     z      p
#> TIGS 0.29631   1.34488  0.02858 10.37 <2e-16
#> 
#> Likelihood ratio test=94.47  on 1 df, p=< 2.2e-16
#> n= 8413, number of events= 2365

dplyr::tibble(
  Variable = c("APS", "TMB", "TIGS"),
  N = c(summary(fit_APS)$n, summary(fit_TMB)$n, summary(fit_TIGS)$n),
  Coef = c(summary(fit_APS)$conf.int[1], summary(fit_TMB)$conf.int[1], summary(fit_TIGS)$conf.int[1]),
  Lower = c(summary(fit_APS)$conf.int[3], summary(fit_TMB)$conf.int[3], summary(fit_TIGS)$conf.int[3]),
  Upper = c(summary(fit_APS)$conf.int[4], summary(fit_TMB)$conf.int[4], summary(fit_TIGS)$conf.int[4]),
  P.value = c(summary(fit_APS)$logtest[3], summary(fit_TMB)$logtest[3], summary(fit_TIGS)$logtest[3])
)
#> # A tibble: 3 x 6
#>   Variable     N  Coef Lower Upper  P.value
#>   <chr>    <int> <dbl> <dbl> <dbl>    <dbl>
#> 1 APS       8949  1.45  1.25  1.69 1.90e- 6
#> 2 TMB       9476  1.19  1.16  1.22 2.78e-38
#> 3 TIGS      8413  1.34  1.27  1.42 2.49e-22

We access survival effects of APS, TMB and TIGS across TCGA studies using unvariable cox model.

Firstly, we implement cox model and then obtain key result values from fit, i.e. p value and corresponding 95% confident interval.

library(survival)
# calculate APM cox model by project
model_APM <- df_os %>%
  filter(!is.na(nAPM)) %>%
  group_by(Project) %>%
  dplyr::do(coxfit = coxph(Surv(time = Time, event = Event) ~ nAPM, data = .)) %>%
  summarise(
    Project = Project,
    Coef = summary(coxfit)$conf.int[1],
    Lower = summary(coxfit)$conf.int[3],
    Upper = summary(coxfit)$conf.int[4],
    Pvalue = summary(coxfit)$logtest[3]
  )

# use nTMB or log(nTMB)
model_TMB <- df_os %>%
  filter(!is.na(nTMB)) %>%
  group_by(Project) %>%
  dplyr::do(coxfit = coxph(Surv(time = Time, event = Event) ~ log(nTMB), data = .)) %>%
  summarise(
    Project = Project,
    Coef = summary(coxfit)$conf.int[1],
    Lower = summary(coxfit)$conf.int[3],
    Upper = summary(coxfit)$conf.int[4],
    Pvalue = summary(coxfit)$logtest[3]
  )

model_TIGS <- df_os %>%
  filter(!is.na(TIGS)) %>%
  group_by(Project) %>%
  dplyr::do(coxfit = coxph(Surv(time = Time, event = Event) ~ TIGS, data = .)) %>%
  summarise(
    Project = Project,
    Coef = summary(coxfit)$conf.int[1],
    Lower = summary(coxfit)$conf.int[3],
    Upper = summary(coxfit)$conf.int[4],
    Pvalue = summary(coxfit)$logtest[3]
  )

N_APM <- df_os %>%
  filter(!is.na(nAPM)) %>%
  group_by(Project) %>%
  summarise(N = n())

N_TMB <- df_os %>%
  filter(!is.na(nTMB)) %>%
  group_by(Project) %>%
  summarise(N = n())

N_TIGS <- df_os %>%
  filter(!is.na(TIGS)) %>%
  group_by(Project) %>%
  summarise(N = n())


cox_APM <- full_join(model_APM, N_APM)
#> Joining, by = "Project"
cox_TMB <- full_join(model_TMB, N_TMB)
#> Joining, by = "Project"
cox_TIGS <- full_join(model_TIGS, N_TIGS)
#> Joining, by = "Project"


save(cox_APM, cox_TMB, cox_TIGS, df_summary, file = "results/unicox.RData")

UPDATE: Adjust p values and order them by median APM/TMB/TIGS.

cox_APM <- cox_APM %>%
  mutate(Pvalue = p.adjust(Pvalue, method = "fdr")) %>%
  arrange(factor(Project, levels = arrange(df_summary, medianAPM) %>% .$Project))

cox_TMB <- cox_TMB %>%
  mutate(Pvalue = p.adjust(Pvalue, method = "fdr")) %>%
  arrange(factor(Project, levels = arrange(df_summary, medianTMB) %>% .$Project))

cox_TIGS <- cox_TIGS %>%
  mutate(Pvalue = p.adjust(Pvalue, method = "fdr")) %>%
  arrange(factor(Project, levels = arrange(df_summary, medianTIGS) %>% .$Project))

Secondly, we generate forest plot.

library(forestplot)
#> Loading required package: grid
#> Loading required package: checkmate

########### Forest plot

#------ APM
options(digits = 2)
forest_APM <- rbind(
  c("Project", NA, NA, NA, "fdr", "No."),
  cox_APM
) %>% as.data.frame()
forest_APM$HR <- c("HR", format(as.numeric(forest_APM$Coef[-1]), digits = 2))
forest_APM$Pvalue <- c("fdr", format(as.numeric(forest_APM$Pvalue[-1]), digits = 2))

forestplot(
  fn.ci_norm = fpDrawCircleCI,
  forest_APM[, c("Project", "N", "HR", "Pvalue")],
  mean = c(NA, log(cox_APM$Coef)), lower = c(NA, log(cox_APM$Lower)), upper = c(NA, log(cox_APM$Upper)),
  is.summary = c(TRUE, rep(FALSE, 32)),
  clip = c(-4, 4), zero = 0,
  col = fpColors(box = "royalblue", line = "black", summary = "royalblue", hrz_lines = "black"),
  vertices = TRUE,
  xticks = c(-4, -2, -1, 0, 1, 2, 4),
  hrzl_lines = list("2" = gpar(lty = 1, col = "black")),
  boxsize = 0.5,
  # graph.pos = 3,
  xlab = "log Hazard Ratio"
)


#pdf("forestplot_APS.pdf", onefile = F)
forestplot(
  fn.ci_norm = fpDrawCircleCI,
  forest_APM[, c("Project", "N", "HR", "Pvalue")],
  mean = c(NA, log(cox_APM$Coef)), lower = c(NA, log(cox_APM$Lower)), upper = c(NA, log(cox_APM$Upper)),
  is.summary = c(TRUE, rep(FALSE, 32)),
  clip = c(-4, 4), zero = 0,
  col = fpColors(box = "royalblue", line = "black", summary = "royalblue", hrz_lines = "black"),
  vertices = TRUE,
  xticks = c(-4, -2, -1, 0, 1, 2, 4),
  hrzl_lines = list("2" = gpar(lty = 1, col = "black")),
  boxsize = 0.5,
  # graph.pos = 3,
  xlab = "log Hazard Ratio"
)

#dev.off()

#----- TMB
forest_TMB <- rbind(
  c("Project", NA, NA, NA, "fdr", "No."),
  cox_TMB
) %>% as.data.frame()
forest_TMB$HR <- c("HR", format(as.numeric(forest_TMB$Coef[-1]), digits = 2))
forest_TMB$Pvalue <- c("fdr", format(as.numeric(forest_TMB$Pvalue[-1]), digits = 2))


#pdf("forestplot_TMB.pdf", onefile = F)
forestplot(
  fn.ci_norm = fpDrawCircleCI,
  forest_TMB[, c("Project", "N", "HR", "Pvalue")],
  mean = c(NA, log(cox_TMB$Coef)), lower = c(NA, log(cox_TMB$Lower)), upper = c(NA, log(cox_TMB$Upper)),
  is.summary = c(TRUE, rep(FALSE, 32)),
  clip = c(-4, 4), zero = 0,
  col = fpColors(box = "royalblue", line = "black", summary = "royalblue", hrz_lines = "black"),
  vertices = TRUE,
  xticks = c(-4, -2, -1, 0, 1, 2, 4),
  hrzl_lines = list("2" = gpar(lty = 1, col = "black")),
  boxsize = 0.5,
  # graph.pos = 3,
  xlab = "log Hazard Ratio"
)

#dev.off()

#---------- TIGS
forest_TIGS <- rbind(
  c("Project", NA, NA, NA, "fdr", "No."),
  cox_TIGS
) %>% as.data.frame()
forest_TIGS$HR <- c("HR", format(as.numeric(forest_TIGS$Coef[-1]), digits = 2))
forest_TIGS$Pvalue <- c("fdr", format(as.numeric(forest_TIGS$Pvalue[-1]), digits = 2))


#pdf("forestplot_TIGS.pdf", onefile = F)
forestplot(
  fn.ci_norm = fpDrawCircleCI,
  forest_TIGS[, c("Project", "N", "HR", "Pvalue")],
  mean = c(NA, log(cox_TIGS$Coef)), lower = c(NA, log(cox_TIGS$Lower)), upper = c(NA, log(cox_TIGS$Upper)),
  is.summary = c(TRUE, rep(FALSE, 32)),
  clip = c(-4, 4), zero = 0,
  col = fpColors(box = "royalblue", line = "black", summary = "royalblue", hrz_lines = "black"),
  vertices = TRUE,
  xticks = c(-4, -2, -1, 0, 1, 2, 4),
  hrzl_lines = list("2" = gpar(lty = 1, col = "black")),
  boxsize = 0.5,
  # graph.pos = 3,
  xlab = "log Hazard Ratio"
)

#dev.off()

APS, TIGS and B2M mutation

Obtain B2M mutation status.

rm(list = ls())

# load package
require(TCGAmutations)


study_list <- tcga_available()$Study_Abbreviation[-34]
cohorts <- system.file("extdata", "cohorts.txt", package = "TCGAmutations")
cohorts <- data.table::fread(input = cohorts)

# calculate TMB
lapply(study_list, function(study) {
  require(maftools)
  TCGAmutations::tcga_load(study)
  maf <- eval(as.symbol(tolower(paste0("TCGA_", study, "_mc3"))))
  maf@data %>%
    group_by(Tumor_Sample_Barcode) %>%
    summarise(B2M = sum(grepl("^B2M$", Hugo_Symbol)))
}) -> tcga_b2m
names(tcga_b2m) <- study_list

# 33 study available, merge them
TCGA_B2M <- purrr::reduce(tcga_b2m, dplyr::bind_rows)

if (!file.exists("results/TCGA_B2M.RData")) {
  save(TCGA_B2M, file = "results/TCGA_B2M.RData")
}

rm(list = grep("tcga_*", ls(), value = TRUE))
rm(list = ls())

rm(list = ls())
load(file = "results/TCGA_B2M.RData")
load(file = "results/TCGA_ALL.RData")

tcga_all <- tcga_all %>%
  mutate(
    nAPM = (APM - min(APM, na.rm = TRUE)) / (max(APM, na.rm = TRUE) - min(APM, na.rm = TRUE)),
    nTMB = TMB_NonsynVariants / 38,
    TIGS = log(nTMB + 1) * nAPM
  )

df_b2m <-
  dplyr::inner_join(tcga_all, TCGA_B2M, by = "Tumor_Sample_Barcode")

table(df_b2m$B2M)
#> 
#>     0     1     2     3     4 
#> 10007    92    27     4     1

df_b2m <- df_b2m %>%
  dplyr::mutate(
    B2M_Status = ifelse(B2M > 0, "Mutant", "WT")
  ) %>%
  dplyr::filter(!is.na(APM), !is.na(TIGS), !is.na(B2M))

table(df_b2m$B2M_Status)
#> 
#> Mutant     WT 
#>     97   8434

ggpubr::ggboxplot(
  df_b2m,
  x = "B2M_Status", y = "APM",
  ylab = "APS", xlab = "B2M mutation status"
) +
  ggpubr::stat_compare_means(method = "t.test", label.x = 1.3)

ggpubr::ggboxplot(
  df_b2m,
  x = "B2M_Status", y = "TIGS",
  xlab = "B2M mutation status"
) +
  ggpubr::stat_compare_means(method = "t.test", label.x = 1.3)

ggpubr::ggboxplot(
  df_b2m,
  x = "B2M_Status", y = "TMB_NonsynVariants",
  ylab = "TMB", xlab = "B2M mutation status"
) +
  ggpubr::stat_compare_means(method = "wilcox.test", label.x = 1.3)

Focus on tructing mutation https://www.nejm.org/doi/10.1056/NEJMoa1604958?url_ver=Z39.88-2003&rfr_id=ori:rid:crossref.org&rfr_dat=cr_pub%3dwww.ncbi.nlm.nih.gov

rm(list = ls())

# load package
require(TCGAmutations)


study_list <- tcga_available()$Study_Abbreviation[-34]
cohorts <- system.file("extdata", "cohorts.txt", package = "TCGAmutations")
cohorts <- data.table::fread(input = cohorts)

# calculate TMB
lapply(study_list, function(study) {
  require(maftools)
  TCGAmutations::tcga_load(study)
  maf <- eval(as.symbol(tolower(paste0("TCGA_", study, "_mc3"))))
  rbind(maf@data, maf@maf.silent) %>%
    group_by(Tumor_Sample_Barcode) %>%
    summarise(B2M = sum(grepl("^B2M$", Hugo_Symbol)
    & Variant_Classification %in% c("Nonsense_Mutation")))
}) -> tcga_b2m
names(tcga_b2m) <- study_list

# 33 study available, merge them
TCGA_B2M <- purrr::reduce(tcga_b2m, dplyr::bind_rows)

if (!file.exists("results/TCGA_B2M2.RData")) {
  save(TCGA_B2M, file = "results/TCGA_B2M2.RData")
}

rm(list = grep("tcga_*", ls(), value = TRUE))
rm(list = ls())

rm(list = ls())
load(file = "results/TCGA_B2M2.RData")
load(file = "results/TCGA_ALL.RData")

tcga_all <- tcga_all %>%
  mutate(
    nAPM = (APM - min(APM, na.rm = TRUE)) / (max(APM, na.rm = TRUE) - min(APM, na.rm = TRUE)),
    nTMB = TMB_NonsynVariants / 38,
    TIGS = log(nTMB + 1) * nAPM
  )

df_b2m <-
  dplyr::inner_join(tcga_all, TCGA_B2M, by = "Tumor_Sample_Barcode")

table(df_b2m$B2M)
#> 
#>     0     1     2 
#> 10129    15     3

df_b2m <- df_b2m %>%
  dplyr::mutate(
    B2M_Status = ifelse(B2M > 0, "Nonsense", "Non-nonsense")
  ) %>%
  dplyr::filter(!is.na(APM), !is.na(TIGS), !is.na(B2M))

table(df_b2m$B2M_Status)
#> 
#> Non-nonsense     Nonsense 
#>         8518           13

ggpubr::ggboxplot(
  df_b2m,
  x = "B2M_Status", y = "APM",
  ylab = "APS", xlab = "B2M mutation status"
) +
  ggpubr::stat_compare_means(method = "t.test", label.x = 1.3)

ggpubr::ggboxplot(
  df_b2m,
  x = "B2M_Status", y = "TIGS",
  xlab = "B2M mutation status"
) +
  ggpubr::stat_compare_means(method = "t.test", label.x = 1.3)

ggpubr::ggboxplot(
  df_b2m,
  x = "B2M_Status", y = "TMB_NonsynVariants",
  ylab = "TMB", xlab = "B2M mutation status"
) +
  ggpubr::stat_compare_means(method = "wilcox.test", label.x = 1.3)

Check gene expression in these mutant samples.

load("results/TCGA_RNASeq_PanCancer.RData")
B2M_expr <- RNASeq_pancan %>%
  dplyr::filter(sample == "B2M") %>%
  dplyr::select(-sample) %>%
  as.numeric()

B2M_expr <- dplyr::tibble(
  Tumor_Sample_Barcode = colnames(RNASeq_pancan)[-1],
  B2M_expr = B2M_expr
) %>%
  dplyr::filter(substr(Tumor_Sample_Barcode, 14, 15) == "01") %>%
  dplyr::mutate(Tumor_Sample_Barcode = substr(Tumor_Sample_Barcode, 1, 12))

B2M_expr <- dplyr::left_join(
  df_b2m,
  B2M_expr
)
#> Joining, by = "Tumor_Sample_Barcode"

ggpubr::ggboxplot(
  B2M_expr,
  x = "B2M_Status", y = "B2M_expr",
  xlab = "B2M mutation status"
) +
  ggpubr::stat_compare_means(method = "t.test", label.x = 1.3)

rm(list = ls())

Therefore, APS/TIGS cannot capture whether B2M lose function or not. However, we do observe that B2M mutation upgrade APS/TIGS, this may be explained by compensatory expression of APS gene signature.

Gene sets enriched in patients with high APM score

To identify the specific gene sets/pathway associated with high APS, we firstly run differential gene expression analysis for each TCGA cancer type based on APS status. Patients with APS of first quartile was defined as “APS-High”, patients with APS of the forth quartile was defined as “APS-Low”. Genes with p value <0.01 and FDR <0.05 were ranked by logFC from top to bottom and then inputted into GSEA function of R package clusterProfiler with custom gene sets download from Molecular Signature Database v6.2. Normalized enrichment score (NES) was used to rank the differentially enriched gene sets. In results from hallmark gene sets, several gene signatures (especially interferon alpha/gamma response) were found to be enriched in most TCGA cancer types with high APS, suggesting high APS are strongly associated with interferon alpha/gamma signaling pathway.

The code below is runned on linux server (need much time), thus if reader wanna reproduce it, you may need to download gene sets from Molecular Signature Database v6.2 and modify some file path.

rm(list = ls())
#----------------------------------------
# APM DEG and Pathway Enrichment Analysis
#----------------------------------------
library(tidyverse)

load("results/TCGA_RNASeq_PanCancer.RData")
load("results/gsva_tcga_pancan.RData")
load("results/TCGA_tidy_Clinical.RData")

df.gsva <- full_join(TCGA_Clinical.tidy, gsva.pac, by = c("Tumor_Sample_Barcode" = "tsb"))

tcga_info <- df.gsva
rm(df.gsea, df.gsva)
gc()

tcga_info <- tcga_info %>%
  select(Project:OS.time, Age, Gender, sample_type, Tumor_stage, APM) %>%
  filter(!is.na(APM))

projects <- unique(tcga_info[, "Project"])
samples <- colnames(RNASeq_pancan)[-1]

#-----------------------------------
# Build a workflow to calculate DEGs
#-----------------------------------
findDEGs <- function(info_df = NULL, expr_df = NULL, col_sample = "Tumor_Sample_Barcode", col_group = "APM",
                     col_subset = "Project", method = "limma", threshold = 0.25,
                     save = FALSE, filename = NULL) {
  stopifnot(!is.null(info_df), !is.null(expr_df))
  stopifnot(threshold >= 0 & threshold <= 1)

  if (!require(data.table)) {
    install.packages("data.table", dependencies = TRUE)
  }

  info_df <- setDT(info_df[, c(col_sample, col_group, col_subset)])
  colnames(info_df) <- c("sample", "groupV", "subset")
  info_df <- info_df[sample %in% colnames(expr_df)]
  info_df <- info_df[!is.na(subset)]
  all_sets <- unique(info_df[, subset])


  if ("DEGroup" %in% colnames(info_df)) {
    stop("DEGroup column exists, please rename and re-run.")
  }

  # set threshold
  th1 <- threshold
  th2 <- 1 - th1

  info_df[, DEGroup := ifelse(groupV < quantile(groupV, th1), "Low",
    ifelse(groupV > quantile(groupV, th2), "High", NA)
  ), by = subset]
  info_df <- info_df[!is.na(DEGroup)]
  sets <- unique(info_df[, subset])
  may_del <- setdiff(all_sets, sets)
  if (length(may_del) != 0) {
    message("Following groups has been filtered out because of threshold setting, you better check")
    print(may_del)
  }

  info_list <- lapply(sets, function(x) info_df[subset == x])
  names(info_list) <- sets

  col1 <- colnames(expr_df)[1]


  options(digits = 4)

  doDEG <- function(input, method = NULL) {
    #-- prepare
    exprSet <- expr_df[, c(col1, input$sample)]
    exprSet <- as.data.frame(exprSet)
    exprSet <- na.omit(exprSet)
    input <- as.data.frame(input)
    rownames(exprSet) <- exprSet[, 1]
    exprSet <- exprSet[, -1]

    sample_tb <- table(input$DEGroup)

    #--- make sure have some samples
    if (!length(sample_tb) < 2 & all(sample_tb >= 5)) {
      group_list <- input$DEGroup

      if ("limma" %in% method) {
        suppressMessages(library(limma))
        design <- model.matrix(~ 0 + factor(group_list))
        colnames(design) <- c("High", "Low")
        cont.matrix <- makeContrasts("High-Low", levels = design)

        fit <- lmFit(exprSet, design)
        fit2 <- contrasts.fit(fit, cont.matrix)
        fit2 <- eBayes(fit2)
        topTable(fit2, number = Inf, adjust.method = "BH")
      }
    }
  }

  res <- lapply(info_list, doDEG, method = method)
  names(res) <- sets
  return(res)
}

# threshold = 0.25 is used in manuscript
# DEG_pancan2 = findDEGs(info_df = tcga_info, expr_df = RNASeq_pancan)
# Here we use threshold = 0.5 to do this analysis to respond
# reviewer's comment
DEG_pancan2 <- findDEGs(info_df = tcga_info, expr_df = RNASeq_pancan, threshold = 0.5)
save(DEG_pancan2, file = "results/DEG_pancan.RData")

#----------------------------------------------------
# Pathway enrichment analysis using clusterProfiler
#--------------------------------------------------
load(file = "results/DEG_pancan.RData")
library(clusterProfiler)
library(tidyverse)
library(openxlsx)

#---------------

## - setting
pvalue <- 0.01
adj.pvalue <- 0.05

## - process

#------- Reading GSEA genesets files
hallmark <- read.gmt("/public/data/VM_backup/biodata/MsigDB/h.all.v6.2.symbols.gmt")
c1 <- read.gmt("/public/data/VM_backup/biodata/MsigDB/c1.all.v6.2.symbols.gmt")
c2_kegg <- read.gmt("/public/data/VM_backup/biodata/MsigDB/c2.cp.kegg.v6.2.symbols.gmt")
c2_reactome <- read.gmt("/public/data/VM_backup/biodata/MsigDB/c2.cp.reactome.v6.2.symbols.gmt")
c3 <- read.gmt("/public/data/VM_backup/biodata/MsigDB/c3.all.v6.2.symbols.gmt")
c4 <- read.gmt("/public/data/VM_backup/biodata/MsigDB/c4.all.v6.2.symbols.gmt")
c5_mf <- read.gmt("/public/data/VM_backup/biodata/MsigDB/c5.mf.v6.2.symbols.gmt")
c5_bp <- read.gmt("/public/data/VM_backup/biodata/MsigDB/c5.bp.v6.2.symbols.gmt")
c6 <- read.gmt("/public/data/VM_backup/biodata/MsigDB/c6.all.v6.2.symbols.gmt")
c7 <- read.gmt("/public/data/VM_backup/biodata/MsigDB/c7.all.v6.2.symbols.gmt")
#-=---------

goGSEA <- function(DEG, prefix = NULL, pvalue = 0.01, adj.pvalue = 0.05, destdir = "~/projects/tumor-immunogenicity-score/report/results/GSEA_results") {
  library(clusterProfiler)
  library(openxlsx)
  library(tidyverse)

  filterDEG <- subset(DEG, subset = P.Value < pvalue & adj.P.Val < adj.pvalue)
  filterDEG$SYMBOL <- rownames(filterDEG)
  filterDEG <- filterDEG %>%
    arrange(desc(logFC), adj.P.Val)
  geneList <- filterDEG$logFC
  names(geneList) <- filterDEG$SYMBOL

  res <- list()
  if (base::exists("hallmark")) res$hallmark <- GSEA(geneList, TERM2GENE = hallmark, verbose = FALSE)
  # if (base::exists("c1")) res$c1 = GSEA(geneList, TERM2GENE=c1, verbose=FALSE)
  if (base::exists("c2_kegg")) res$c2_kegg <- GSEA(geneList, TERM2GENE = c2_kegg, verbose = FALSE)
  if (base::exists("c2_reactome")) res$c2_reactome <- GSEA(geneList, TERM2GENE = c2_reactome, verbose = FALSE)
  # if (base::exists("c3")) res$c3 = GSEA(geneList, TERM2GENE=c3, verbose=FALSE)
  # if (base::exists("c4")) res$c4 = GSEA(geneList, TERM2GENE=c4, verbose=FALSE)
  if (base::exists("c5_mf")) res$c5_mf <- GSEA(geneList, TERM2GENE = c5_mf, verbose = FALSE)
  if (base::exists("c5_bp")) res$c5_bp <- GSEA(geneList, TERM2GENE = c5_bp, verbose = FALSE)
  if (base::exists("c6")) res$c6 <- GSEA(geneList, TERM2GENE = c6, verbose = FALSE)
  if (base::exists("c7")) res$c7 <- GSEA(geneList, TERM2GENE = c7, verbose = FALSE)

  getResultList <- lapply(res, function(x) x@result)
  if (!dir.exists(destdir)) dir.create(destdir)
  outpath <- file.path(destdir, prefix)
  write.xlsx(x = getResultList, file = paste0(outpath, ".xlsx"))
  return(res)
}


# remove STAD, CHOL, KICH and DLBC
DEG_pancan2 <- DEG_pancan2[setdiff(names(DEG_pancan2), c("STAD", "CHOL", "KICH", "DLBC"))]
GSEA_list <- Map(goGSEA, DEG_pancan2, names(DEG_pancan2))
save(GSEA_list, file = "results/GSEA_results_rm_notEnriched.RData")

#------- GSEA example plot
# load("results/GSEA_results.RData")
load("results/GSEA_results_rm_notEnriched.RData")
library(clusterProfiler)

gseaplot(GSEA_list$SKCM$hallmark, geneSetID = "HALLMARK_INTERFERON_GAMMA_RESPONSE")

# gseaplot(egmt2, geneSetID = "INTEGRAL_TO_PLASMA_MEMBRANE")
# gseaplot(res$hallmark, geneSetID = "HALLMARK_INTERFERON_GAMMA_RESPONSE")

#------ get main result cloumn
gsea_results <- lapply(GSEA_list, function(x) {
  # x@result %>% select()
  lapply(x, function(gsea) {
    gsea@result %>% dplyr::select(ID, setSize, enrichmentScore, NES, pvalue, qvalues)
  })
})

save(gsea_results, file = "results/GSEA_results_notEnriched_small.RData")
rm(GSEA_list)
gc()

#----- plot
summariseGSEA <- function(pathway = NULL) {
  purrr::reduce(Map(function(x, y, pathway) {
    if (nrow(x[[pathway]]) == 0) {
      res <- NULL
    } else {
      res <- x[[pathway]]
      res$Project <- y
    }
    res
  }, gsea_results, names(gsea_results), pathway), rbind)
}

gsea_summary <- list()
gsea_summary$hallmark <- summariseGSEA(pathway = "hallmark")
# gsea_summary$c2_kegg = summariseGSEA(pathway = "c2_kegg")
gsea_summary$c2_reactome <- summariseGSEA(pathway = "c2_reactome")
gsea_summary$c5_mf <- summariseGSEA(pathway = "c5_mf")
gsea_summary$c5_bp <- summariseGSEA(pathway = "c5_bp")
gsea_summary$c6 <- summariseGSEA(pathway = "c6")
gsea_summary$c7 <- summariseGSEA(pathway = "c7")

save(gsea_summary, file = "results/GSEA_summary_notEnriched.RData")

## plot really
load("../data/df_combine_gsva_clinical.RData")

df.gsva %>%
  filter(!is.na(APM)) %>%
  group_by(Project) %>%
  summarize(MedianAPM = median(APM), N = n()) %>%
  arrange(MedianAPM) -> sortAPM
sortAPM %>% filter(Project != "DLBC") -> sortAPM

library(scales)
gsea_summary$hallmark %>%
  ggplot(mapping = aes(x = Project, y = ID)) +
  geom_tile(aes(fill = NES)) +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red") +
  scale_x_discrete(limits = sortAPM$Project) + theme_classic() +
  theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1))

# df_wide = reshape2::dcast(gsea_summary$hallmark, ID ~ Project, value.var="NES", fill = 0)

library(pheatmap)

plotPathway <- function(df, save = FALSE, path = NULL, silent = FALSE) {
  df_wide <- reshape2::dcast(df, ID ~ Project, value.var = "NES", fill = 0)

  breaksList <- seq(-max(df_wide[, -1], na.rm = TRUE), max(df_wide[, -1], na.rm = TRUE), by = 0.01)
  pheatmap(df_wide %>% column_to_rownames(var = "ID"),
    color = colorRampPalette(c("blue", "white", "red"))(length(breaksList)),
    breaks = breaksList,
    # annotation_col = annotation_col,
    fontsize_row = 8, fontsize_col = 8, silent = silent,
    cellheight = 10, cellwidth = 10, filename = ifelse(save == FALSE, NA, path)
  )
}

gg <- plotPathway(gsea_summary$hallmark, silent = F, path = "results/GSEA_results_plot/Hallmark_Enriched.pdf", save = TRUE)
# gg = plotPathway(gsea_summary$c2_kegg, silent = F, path = "results/GSEA_results_plot/KEGG_Enriched.pdf", save = TRUE)
gg <- plotPathway(gsea_summary$c2_reactome, silent = F, path = "results/GSEA_results_plot/Reactome_Enriched.pdf", save = TRUE)
gg <- plotPathway(gsea_summary$c5_mf, silent = F, path = "results/GSEA_results_plot/GO_MolecuFunction_Enriched.pdf", save = TRUE)
gg <- plotPathway(gsea_summary$c5_bp, silent = F, path = "results/GSEA_results_plot/GO_BiologicalProcess_Enriched.pdf", save = TRUE)
gg <- plotPathway(gsea_summary$c6, silent = F, path = "results/GSEA_results_plot/C6_oncogenicsignatures_Enriched.pdf", save = TRUE)
gg <- plotPathway(gsea_summary$c7, silent = F, path = "results/GSEA_results_plot/C7_immunologicsignatures_Enriched.pdf", save = TRUE)

# gg


# _-------- rank pheatmap by NES value
plotPathway2 <- function(df, save = FALSE, path = NULL, silent = FALSE) {
  df_wide <- reshape2::dcast(df, ID ~ Project, value.var = "NES", fill = 0)
  df_wide <- df_wide[order(rowMeans(df_wide[, -1]), decreasing = TRUE), ]
  rownames(df_wide) <- NULL

  breaksList <- seq(-max(df_wide[, -1], na.rm = TRUE), max(df_wide[, -1], na.rm = TRUE), by = 0.01)

  pheatmap(df_wide %>% column_to_rownames(var = "ID"),
    color = colorRampPalette(c("blue", "white", "red"))(length(breaksList)),
    breaks = breaksList, cluster_rows = FALSE,
    # annotation_col = annotation_col,
    fontsize_row = 8, fontsize_col = 8, silent = silent,
    cellheight = 10, cellwidth = 10, filename = ifelse(save == FALSE, NA, path)
  )
}

gg <- plotPathway2(gsea_summary$c2_reactome, silent = F, path = "results/GSEA_results_plot/Reactome_Enriched2.pdf", save = TRUE)

Session Info

sessionInfo()
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: CentOS Linux 7 (Core)
#> 
#> Matrix products: default
#> BLAS:   /home/public/R/R-base/lib64/R/lib/libRblas.so
#> LAPACK: /home/public/R/R-base/lib64/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] grid      stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] forestplot_1.9      checkmate_1.9.4     survival_3.1-7     
#>  [4] ggpubr_0.2.3        magrittr_1.5        scales_1.0.0       
#>  [7] corrplot_0.84       pheatmap_1.0.12     forcats_0.4.0      
#> [10] stringr_1.4.0       purrr_0.3.3         tidyr_1.0.0        
#> [13] tibble_2.1.3        ggplot2_3.2.1       tidyverse_1.2.1    
#> [16] readr_1.3.1         dplyr_0.8.3         UCSCXenaTools_1.2.7
#> [19] rmdformats_0.3.5    knitr_1.26          pacman_0.5.1       
#> 
#> loaded via a namespace (and not attached):
#>  [1] utf8_1.1.4             questionr_0.7.0        R.utils_2.9.0         
#>  [4] tidyselect_0.2.5       lme4_1.1-21            robust_0.4-18.1       
#>  [7] htmlwidgets_1.5.1      munsell_0.5.0          codetools_0.2-16      
#> [10] DT_0.10                future_1.15.0          miniUI_0.1.1.1        
#> [13] withr_2.1.2            Brobdingnag_1.2-6      metaBMA_0.6.2         
#> [16] colorspace_1.4-1       highr_0.8              rstudioapi_0.10       
#> [19] stats4_3.6.0           DescTools_0.99.30      robustbase_0.93-5     
#> [22] rcompanion_2.3.7       ggsignif_0.6.0         listenv_0.7.0         
#> [25] labeling_0.3           emmeans_1.4.2          rstan_2.19.2          
#> [28] mnormt_1.5-5           MCMCpack_1.4-4         bridgesampling_0.7-2  
#> [31] coda_0.19-3            vctrs_0.2.0            generics_0.0.2        
#> [34] TH.data_1.0-10         metafor_2.1-0          xfun_0.11             
#> [37] R6_2.4.0               BayesFactor_0.9.12-4.2 reshape_0.8.8         
#> [40] logspline_2.1.15       assertthat_0.2.1       promises_1.1.0        
#> [43] multcomp_1.4-10        ggExtra_0.9            gtable_0.3.0          
#> [46] multcompView_0.1-7     globals_0.12.4         processx_3.4.1        
#> [49] mcmc_0.9-6             sandwich_2.5-1         rlang_0.4.1           
#> [52] MatrixModels_0.4-1     EMT_1.1                zeallot_0.1.0         
#> [55] splines_3.6.0          TMB_1.7.15             lazyeval_0.2.2        
#> [58] broom_0.5.2            inline_0.3.15          abind_1.4-5           
#> [61] yaml_2.2.0             reshape2_1.4.3         modelr_0.1.5          
#> [64] crosstalk_1.0.0        backports_1.1.5        httpuv_1.5.2          
#> [67] tools_3.6.0            bookdown_0.15          psych_1.8.12          
#> [70] ellipsis_0.3.0         RColorBrewer_1.1-2     WRS2_1.0-0            
#> [73] ez_4.4-0               Rcpp_1.0.3             plyr_1.8.4            
#>  [ reached getOption("max.print") -- omitted 98 entries ]

Immunotherapy datasets analyses

This part will clearly describe how to analyze APS, TMB and TIGS in immunotherapy datasets, including cancer type level and individual level. Raw data used for these analyses have been preprocessed, detail please read preprocessing part or TCGA pan-cancer analyses of this analysis report.

TIGS and pan-cancer objective response rate to PD-1 inhibition

Previous study has shown that TMB could predict pan-cancer ICI objective response rates (ORR). Here we will evaluate and compare the predictive power of APS, TIGS with TMB in pan-cancer ICI objective response rates prediction. The ORR for anti–PD-1 or anti–PD-L1 therapy will be plotted against the corresponding median APS, TIGS, TMB across multiple cancer types.

Through an extensive literature search, we identified 26 tumor types or subtypes for which data regarding the ORR are available. For each tumor type, we pooled the response data from the largest published studies that evaluated the ORR. We included only studies of anti–PD-1 or anti–PD-L1 monotherapy that enrolled at least 10 patients who were not selected for PD-L1 tumor expression (Identified individual studies and references are available in the manuscript Supplementary Table 3). The median tumor mutational burden for each tumor type was obtained from a validated comprehensive genomic profiling assay performed and provided by Foundation Medicine. The APS information for 23 tumor types were calculated based on TCGA datasets, and the APS for merkel cell carcinoma, cutaneous squamous cell carcinoma and small-cell lung cancer were calculated based on GEO microarray datasets.

Combination of datasets

In code implementation, we firstly combine APM score, TIGS score from TCGA studies and GEO datasets.

library(tidyverse)
## retrieve APM score
load("results/Add_gsva_scc_sclc_merkel.RData")

buildDF <- function(x, tumor_type = NULL) {
  stopifnot(!is.null(tumor_type))
  res <- x %>%
    dplyr::mutate(Project = tumor_type) %>%
    as.data.frame()
  res
}

lapply(gsva.merkel, FUN = buildDF, tumor_type = "Merkel Cell Carcinoma") %>%
  purrr::reduce(dplyr::bind_rows) -> df_merkel
lapply(gsva.scc, FUN = buildDF, tumor_type = "Cutaneous Squamous Cell Carcinoma") %>%
  purrr::reduce(dplyr::bind_rows) -> df_scc
lapply(gsva.sclc, FUN = buildDF, tumor_type = "Small-Cell Lung Cancer") %>%
  purrr::reduce(dplyr::bind_rows) -> df_sclc

## combine datasets
load("results/TCGA_ALL.RData")
df_list <- list(tcga_all %>% filter(!is.na(APM)), df_merkel, df_scc, df_sclc)
purrr::reduce(df_list, bind_rows) -> df_all

if (!file.exists("results/df_all.RData")) {
  save(df_all, file = "results/df_all.RData")
}

Using the combined dataset df_all, we can normalize APM score to [0, 1] region, then we calculate median APM score for each tumor type.

df_all %>%
  mutate(N.APM = (APM - min(APM)) / (max(APM) - min(APM))) %>%
  group_by(Project) %>%
  summarise(MedianAPM = median(N.APM), NumberOfPatient = n()) -> sm_APM

sm_APM
#> # A tibble: 35 x 3
#>    Project                           MedianAPM NumberOfPatient
#>    <chr>                                 <dbl>           <int>
#>  1 ACC                                   0.444              79
#>  2 BLCA                                  0.588             407
#>  3 BRCA                                  0.323            1083
#>  4 CESC                                  0.734             303
#>  5 CHOL                                  0.623              36
#>  6 COAD                                  0.646             286
#>  7 Cutaneous Squamous Cell Carcinoma     0.506               4
#>  8 DLBC                                  0.816              48
#>  9 ESCA                                  0.353             184
#> 10 GBM                                   0.270             154
#> # … with 25 more rows

Lastly, we merge this data to 25 tumor types or subtypes for which data regarding the ORR, TMB are available by hand using Excel. This process has been double checked. With the merged data, we can explore association of ORR and APM score, ORR and TMB etc. in immunotherapy datasets.

Significant correlation between APS, TMB and the ORR

We plot ORR against APS, TMB respectively, and fit their relationship with linear model. We observe that significant correlation between APS, TMB and the ORR.

Load pan-cancer ORR data and show it as a table.

sm_data <- read_csv("../data/summary_data_new_20190411.csv", col_types = cols())

DT::datatable(sm_data,
  options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE
)

DLBCL has high APM and TIGS, yet the response rate in Figure 4 of over 50% is incorrect. This 50%+ response rate is for Hodgkin lymphoma, not DLBCL.

We did mix the DLBC and Hodgkin lymphoma, thus we remove it and update our analyses.

Focus on cancer type with APS, fit data with linear model.

sm_data <- sm_data %>% filter(!is.na(Pool_APM))

lm(Pool_ORR ~ Pool_APM, data = sm_data) -> fit1

lm(Pool_ORR ~ log(Pool_TMB), data = sm_data) -> fit2

summary(fit1)
#> 
#> Call:
#> lm(formula = Pool_ORR ~ Pool_APM, data = sm_data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -15.37  -7.05  -1.34   2.81  31.35 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)  
#> (Intercept)   -0.213      7.788   -0.03    0.978  
#> Pool_APM      31.931     14.465    2.21    0.038 *
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 11 on 23 degrees of freedom
#> Multiple R-squared:  0.175,  Adjusted R-squared:  0.139 
#> F-statistic: 4.87 on 1 and 23 DF,  p-value: 0.0375

summary(fit2)
#> 
#> Call:
#> lm(formula = Pool_ORR ~ log(Pool_TMB), data = sm_data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -11.39  -5.04  -2.32   3.66  26.69 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)     -0.137      3.795   -0.04     0.97    
#> log(Pool_TMB)   11.356      2.343    4.85  6.8e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 8.6 on 23 degrees of freedom
#> Multiple R-squared:  0.505,  Adjusted R-squared:  0.484 
#> F-statistic: 23.5 on 1 and 23 DF,  p-value: 6.84e-05

Of note, here we use log(TMB) will not change the relationshipe between ORR and TMB.

We observe that 17.5% variance of ORR can be explained by APS, and 50.5% variance of ORR can be explained by TMB. The result of TMB is similar to NEJM paper Tumor Mutational Burden and Response Rate to PD-1 inhibition. This result show that TMB and APS are biomarker that predicting ORR of immunotherapy, TMB is better than APS.

Next we plot the linear model.

require(ggrepel)
#> Loading required package: ggrepel
require(scales)

ggplot(sm_data, aes(x = Pool_APM, y = Pool_ORR)) +
  geom_point(aes(color = Patients_APM, size = Patients_ORR)) +
  geom_smooth(method = "lm", se = T) +
  geom_text_repel(aes(label = Cancer_Type), size = 3) +
  labs(
    x = "Median Normalized APM Score ", y = "Objective Response Rate (%)",
    size = "Objective Response Rate\n(no. of patients evaluated)",
    color = "APM Score\n(no. of tumor analyzed)"
  ) +
  scale_size_continuous(breaks = c(50, 100, 500, 1000)) +
  scale_color_gradientn(
    colours = RColorBrewer::brewer.pal(5, name = "OrRd")[-1],
    breaks = c(50, 200, 500, 1000)
  ) +
  theme_bw()



ggplot(sm_data, aes(x = Pool_TMB, y = Pool_ORR)) +
  geom_point(aes(color = Patients_TMB, size = Patients_ORR)) +
  geom_smooth(method = "lm", se = T) +
  geom_text_repel(aes(label = Cancer_Type), size = 3) +
  labs(
    x = "Median No. of Coding Somatic Mutation per MB", y = "Objective Response Rate (%)",
    size = "Objective Response Rate\n(no. of patients evaluated)",
    color = "Tumor Mutational Burden\n(no. of tumor analyzed)"
  ) +
  scale_x_continuous(
    trans = log_trans(),
    breaks = c(2, 10, 20, 30, 40, 50),
    labels = c(2, 10, 20, 30, 40, 50)
  ) +
  scale_size_continuous(breaks = c(50, 100, 500, 1000)) +
  scale_color_gradientn(
    colours = RColorBrewer::brewer.pal(5, name = "OrRd")[-1],
    breaks = c(100, 1000, 5000, 10000)
  ) +
  theme_bw() +
  guides(
    color = guide_colorbar(order = 1)
  )

TIGS definition and performance

From the aspect of biological process, TMB release is independent from processing and presentation of tumor antigens. Now that we have found TMB and APM are biomarkers which can predict object reponse rate of cancer type in immunotherapy, these two factors are independent, and are both required for tumor immunogenicity determination, why not integrate them as a new biomarker which should be better than TMB?

Therefore, theoretically, tumor immunogenicity can be represented as

\[ [“Tumor \quad antigenicity”] \times [“Antigen \quad processing \quad and \quad presenting \quad status”]. \]

We use log(TMB) here because log(TMB) has been proved to have a linear relationship with ORR. So the formula can be written as

\[ TIGS = APS_{normalized} \times log(TMB) \]

However, some tumors have TMB level below 1 mutation/ Mb, to avoid minus number in quantifying “tumor antigenicity”, we add number 1 to all normalized TMB. So in this case, the TIGS formula is

\[ TIGS = APS_{normalized} \times log(TMB + 1) \]

When TMB < 1 mutation/Mb, the log(TMB) would less than 0, if APS is very big, we would get a big minus number in this case, which against common sense.

Now we construct a linear model by integrate TMB and APM.

lm(Pool_ORR ~ log(Pool_TMB + 1):Pool_APM, data = sm_data) -> fit3

summary(fit3)
#> 
#> Call:
#> lm(formula = Pool_ORR ~ log(Pool_TMB + 1):Pool_APM, data = sm_data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -10.07  -4.26  -1.32   1.94  26.70 
#> 
#> Coefficients:
#>                            Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                   -2.66       3.57   -0.75     0.46    
#> log(Pool_TMB + 1):Pool_APM    21.39       3.64    5.88  5.4e-06 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 7.7 on 23 degrees of freedom
#> Multiple R-squared:  0.601,  Adjusted R-squared:  0.583 
#> F-statistic: 34.6 on 1 and 23 DF,  p-value: 5.39e-06

We observe that 60.1% variance of ORR can be explained by TIGS, this result is indeed bettern than TMB.

sm_data$tigs <- sm_data$Pool_APM * log(sm_data$Pool_TMB + 1)

ggplot(sm_data, aes(x = tigs, y = Pool_ORR)) +
  geom_point(aes(size = Patients_ORR)) +
  geom_smooth(method = "lm", se = T) +
  geom_text_repel(aes(label = Cancer_Type), size = 3) +
  labs(
    x = "Tumor Immunogenicity Score ", y = "Objective Response Rate (%)",
    size = "Objective Response Rate\n(no. of patients evaluated)"
  ) +
  scale_size_continuous(breaks = c(50, 100, 500, 1000)) +
  theme_bw()

More exploration of TIGS formula

Would other combination of TMB and APS be better?

As mentioned above, we constructed the formula of TIGS according to our understanding of “tumor antigenicity”. Is there a better combination of these two factors?

There are two basic models we take into consideration.

Model 1: \(ORR = a \times (APS \times log(TMB)) + b\), i.e. \(TIGS = APS \times log(TMB)\) (This is the model we used above)
Model 2: \(ORR = e \times (a \times APS + b\times log(TMB) + c\times APS \times log(TMB)) + d\), i.e. \(TIGS = a \times APS + b\times log(TMB) + c\times APS \times log(TMB)\)

a, b, c, d and e here are coefficients. In R, we don’t need to write them in models, it will take care of them.

Now, let’s observe the results.

# Model 1
lm(Pool_ORR ~ log(Pool_TMB + 1):Pool_APM, data = sm_data) %>% summary()
#> 
#> Call:
#> lm(formula = Pool_ORR ~ log(Pool_TMB + 1):Pool_APM, data = sm_data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -10.07  -4.26  -1.32   1.94  26.70 
#> 
#> Coefficients:
#>                            Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                   -2.66       3.57   -0.75     0.46    
#> log(Pool_TMB + 1):Pool_APM    21.39       3.64    5.88  5.4e-06 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 7.7 on 23 degrees of freedom
#> Multiple R-squared:  0.601,  Adjusted R-squared:  0.583 
#> F-statistic: 34.6 on 1 and 23 DF,  p-value: 5.39e-06
# Model 2
lm(Pool_ORR ~ log(Pool_TMB + 1) * Pool_APM, data = sm_data) %>% summary()
#> 
#> Call:
#> lm(formula = Pool_ORR ~ log(Pool_TMB + 1) * Pool_APM, data = sm_data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>  -8.97  -4.22  -2.46   2.26  26.69 
#> 
#> Coefficients:
#>                            Estimate Std. Error t value Pr(>|t|)
#> (Intercept)                  -17.63      23.61   -0.75     0.46
#> log(Pool_TMB + 1)             12.17      15.51    0.78     0.44
#> Pool_APM                      24.04      45.06    0.53     0.60
#> log(Pool_TMB + 1):Pool_APM     1.02      29.21    0.03     0.97
#> 
#> Residual standard error: 7.8 on 21 degrees of freedom
#> Multiple R-squared:  0.621,  Adjusted R-squared:  0.567 
#> F-statistic: 11.5 on 3 and 21 DF,  p-value: 0.000114

Model 2 is a little better than model 1, however model 1 is much simpler (all terms of model 1 is included in model 2!). Following Occam’s razor, we choose model 1 when compare it with model 2.

Any promotion of formula \(TIGS = APS \times log(TMB + 1)\)?

We still have a question about TIGS formula. The TMB in formula is calculated as

\[ TMB = ln(\frac{whole \quad exome \quad mutation \quad number}{38} + 1 ) \]

We choose nonsynonymous mutations because they have functional effects. Only functional effects on protein will generate neo-antigen (peptide).

We use 38 here because 38 MB is the estimated length of exome, so we can normalize TMB to per MB. However, in math model, there may be a better value “a” than 38. We don’t know yet, so we need to explore this “a” value.

We simulate “a” value from 0.1 to 10000 with step size 0.1 in the following formula, the resulted TIGS is then used to fit a linear model for pan-cancer ICI ORR prediction. We calculate differnt R Square value under each “a” and see how it changes.

\[ TIGS = APS_{normalized} \times ln(\frac{whole \quad exome \quad mutation \quad number}{a} + 1) \]

Implementation:

df <- sm_data
df$APS <- df$Pool_APM
df$TMB <- df$Pool_TMB * 38 # transform back to whole exome nonsynonymous mutation number
df$ORR <- df$Pool_ORR

df <- dplyr::select(df, APS, TMB, ORR)


calcTrends <- function(df, a) {
  sapply(a, function(x, data) {
    summary(lm(ORR ~ APS:log((TMB / x + 1)), data = df))$r.squared
  }, data = df)
}

a <- seq(0.1, 10000, 0.1)
res <- calcTrends(df, a)

res_df <- data.frame(a = a, Rsquare = res)

Plot the simulation result.

opar <- par(no.readonly = TRUE)
par(mfrow = c(1, 2))
plot(res_df, las = 1, xlab = "a", ylab = "R Square", type = "l")
plot(res_df[1:1000, ], las = 1, xlab = "a", ylab = "R Square", type = "l")
abline(v = 38, lty = 2, col = "red")

par(opar)

We can find that R squares of linear model have a maximum value when “a” is around 40. Therefore, “a=38” is optimized to rescale the raw whole exome mutation counts and balance the contribution for APM and TMB in TIGS formula.

Performance comparison on immunotherapy clinical response prediction and survival benefit

Description of datasets please see preprocessing part or “Method Section” of manuscript.

TMB is a well-known biomarker and has been applied. Considering our TIGS is based on TMB, thus the most important point we wanna do in this section is comparing the prediction/clinical perfermance between TMB and TIGS.

However, some other biomarkers can predict immunotherapy response have been studied by other groups. Therefore, this section we will include following biomarkers in comparsion.

APS - antigen processing and presenting efficiency score
TMB - tumor mutation burden
TIGS - tumor immunogenicity score
IIS - immune infiltration score
IFNG - interferon gamma response biomarkers of 6 genes including IFNG, STAT1, IDO1, CXCL10, CXCL9, and HLA-DRA
CD8 - gene expression level of CD8A + CD8B
PDL1 - an immunohistochemistry (IHC) biomarker approved by FDA. This study we use PD-L1 gene expression as the IHC surrogate
TIDE - TIDE signature which integrates the expression signatures of T cell dysfunction and T cell exclusion to model tumor immune evasion.
ISG.RS and IFNG.GS signatures described in Benci et al, Cell 2019
PHBR scores from MHC-I Genotype Restricts the Oncogenic Mutational Landscape and Evolutionary Pressure against MHC Class II Binding Cancer Mutations

The predicted values of gene expression biomarkers (i.e. IFNG, CD8, PDL1) are the average values among all members defined by the original publications. For ISG.RS and IFNG.GS, we follow the reference to calculate normlized score.

We will also try to combine these gene expression based signatures with TMB to see their performance.

Preprocessing of immunotherapy datasets

The first step is to preprocess 3 immunotherapy datasets with gene expression, TMB and clinical outcome available. APS from Pancan analysis is used to normalize APS here.

# get min and max APM value based on TCGA and GEO pan-cancer data
load("results/df_all.RData")
min_APS <- min(df_all$APM)
max_APS <- max(df_all$APM)
rm(df_all)

# load gene list
load("results/merged_geneList.RData")
# load function
source("../code/functions.R")

Hugo et al 2016 dataset

## Data source: Genomic and Transcriptomic Features of Response to Anti-PD-1 Therapy in Metastatic Melanoma. Cell 2016 Mar 24;165(1):35-44. PMID: 26997480
suppressPackageStartupMessages(library(GEOquery))
library(tidyverse)
library(readxl)
geo_dir <- "../data/GEOdata"

gse <- "GSE78220"
GSE_78220 <- getGEO(gse, GSEMatrix = TRUE, AnnotGPL = TRUE, destdir = geo_dir)
GSE_78220 <- GSE_78220$GSE78220_series_matrix.txt.gz

#### Download gene expression using following command
# download.file(url = "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE78220&format=file&file=GSE78220%5FPatientFPKM%2Exlsx", destfile = "../data/GSE78220_FPKM.xlsx")

GSE78220_FPKM <- readxl::read_xlsx("../data/GSE78220_FPKM.xlsx")

GSE78220_FPKM <- GSE78220_FPKM %>%
  as.data.frame() %>%
  mutate(mean_expr = rowMeans(.[, -1], na.rm = TRUE)) %>%
  arrange(Gene, desc(mean_expr)) %>%
  distinct(Gene, .keep_all = TRUE) %>%
  dplyr::select(-mean_expr) %>%
  as.tibble()

# pData(GSE_78220) %>% View()

# normalize gene expression as log(x + 1)
GSE78220_Norm <- GSE78220_FPKM
GSE78220_Norm[, -1] <- log2(GSE78220_Norm[, -1] + 1)

# apply gsva
applyGSVA(merged_geneList,
  group_col = "Cell_type",
  gene_col = "Symbol", ExprMatList = list(GSE78220_Norm), method = "gsva"
) -> gsva.hugo2016

gsva.hugo2016 <- calc_TisIIs(gsva.hugo2016[[1]])

# normalize APS
APS_hugo <- gsva.hugo2016$APM
APS_hugo <- (APS_hugo - min_APS) / (max_APS - min_APS)

names(APS_hugo) <- colnames(GSE78220_Norm)[-1]
APS_hugo
names(APS_hugo) <- sub(pattern = "(Pt.*)\\.baseline", "\\1", names(APS_hugo))


# keep biggest value
APS_hugo <- APS_hugo[-21]
names(APS_hugo)[c(14, 20)] <- c("Pt16", "Pt27")
APS_hugo

# keep corresponding expression data for downstream analysis
rna_hugo <- GSE78220_Norm[, -22]
colnames(rna_hugo)[-1] <- names(APS_hugo)

if (!file.exists("results/Hugo2016_RNA.RData")) {
  save(rna_hugo, file = "results/Hugo2016_RNA.RData")
}

GSE78220.APS <- tibble(PatientID = names(APS_hugo), APS = APS_hugo, IIS = gsva.hugo2016$IIS[-21])

# generate results for random selected APS genes
set.seed(123456)
apm_gens <- merged_geneList %>% dplyr::filter(Cell_type == "APM")
sapply(1:100, function(i) {
  tempGens <- sample(GSE78220_Norm$Gene, 18)
  apm_gens$Symbol <- tempGens
  applyGSVA(apm_gens,
    group_col = "Cell_type",
    gene_col = "Symbol", ExprMatList = list(GSE78220_Norm), method = "gsva"
  )
}) -> gsva.hugo2016.random

gsva.hugo2016.random <- as.data.frame(sapply(gsva.hugo2016.random, function(x) {
  (x$APM - min_APS) / (max_APS - min_APS)
}))
colnames(gsva.hugo2016.random) <- paste0("APS", 1:100)
gsva.hugo2016.random <- gsva.hugo2016.random[-21, ]
rownames(gsva.hugo2016.random) <- names(APS_hugo)

# read clincal and tmb data
GSE_78220.TMB <- read_csv("../data/Cell2016.csv", col_types = cols())
GSE_78220.TMB$nTMB <- GSE_78220.TMB$TotalNonSyn / 38

info_hugo <- full_join(GSE_78220.TMB, GSE78220.APS, by = "PatientID")
info_hugo <- info_hugo[, c(1:5, 14:16, 13)]

# calculate TIGS score
info_hugo$TIGS <- log(info_hugo$nTMB) * info_hugo$APS

save(info_hugo, file = "results/Hugo2016_Info.RData")

Van Allen et al 2015 dataset

#----------------------------------------------------------
# Data source :
# Science 2015: Genomic correlates of response to CTLA4 blockade in metastatic melanoma
#--------------------------
library(readxl)
suppressPackageStartupMessages(library(tidyverse))
science2015_TMB <- read_xlsx("../data/Science2015_TMB_List_AllPatients.xlsx")
science2015_cli1 <- read_xlsx("../data/Science2015_Clinical.xlsx", sheet = 1)
science2015_cli2 <- read_xlsx("../data/Science2015_Clinical.xlsx", sheet = 2)
science2015_rna <- read_tsv("../data/MEL-IPI-Share.rpkm.gct", col_types = cols())

#---- Preprocess
vc.nonSilent <- c(
  "Frame_Shift_Del", "Frame_Shift_Ins", "Splice_Site", "Translation_Start_Site",
  "Nonsense_Mutation", "Nonstop_Mutation", "In_Frame_Del",
  "In_Frame_Ins", "Missense_Mutation"
)

science2015_TMB %>%
  group_by(patient) %>%
  summarise(
    TotalMutation = n(),
    TotalNonSyn = sum(Variant_Classification %in% vc.nonSilent),
    nTMB = TotalNonSyn / 38
  ) -> science2015_TMB_summary

#---- remove duplicate symbols of rna data
science2015_rna %>%
  mutate(symbol = sub("(.*)\\..*$", "\\1", Description)) %>%
  dplyr::select(-Name, -Description) %>%
  dplyr::select(symbol, everything()) %>%
  as.data.frame() %>%
  mutate(mean_expr = rowMeans(.[, -1], na.rm = TRUE)) %>%
  arrange(symbol, desc(mean_expr)) %>%
  distinct(symbol, .keep_all = TRUE) %>%
  dplyr::select(-mean_expr) %>%
  as.tibble() -> science2015_ge

length(intersect(science2015_ge$symbol, GSE78220_Norm$Gene))
# the gene number of science paper is bigger than hugo 2016
# maybe miRNA and some other RNA included
# keep them almost equal\
science2015_ge <- science2015_ge %>%
  filter(symbol %in% GSE78220_Norm$Gene)

science2015_ge[, -1] <- log2(science2015_ge[, -1] + 1)

applyGSVA(merged_geneList,
  group_col = "Cell_type",
  gene_col = "Symbol",
  ExprMatList = list(science2015_ge), method = "gsva"
) -> gsva.VanAllen2015

gsva.VanAllen2015 <- calc_TisIIs(gsva.VanAllen2015[[1]])

APS_VanAllen <- gsva.VanAllen2015$APM
names(APS_VanAllen) <- colnames(science2015_ge)[-1]
APS_VanAllen <- (APS_VanAllen - min_APS) / (max_APS - min_APS)

APS_VanAllen
names(APS_VanAllen) <- sub("^MEL.IPI_(.*)\\.Tumor.*$", "\\1", names(APS_VanAllen))
APS_VanAllen

rna_VanAllen <- science2015_ge
colnames(rna_VanAllen)[-1] <- names(APS_VanAllen)
## save gene experssion
if (!file.exists("results/VanAllen2015_RNA.RData")) {
  save(rna_VanAllen, file = "results/VanAllen2015_RNA.RData")
}

APS_VanAllen_df <- tibble(
  patient = names(APS_VanAllen), APS = APS_VanAllen,
  IIS = gsva.VanAllen2015$IIS
)


# generate results for random selected APS genes
set.seed(123456)
# apm_gens = merged_geneList %>% dplyr::filter(Cell_type == "APM")
sapply(1:100, function(i) {
  tempGens <- sample(science2015_ge$symbol, 18)
  apm_gens$Symbol <- tempGens
  applyGSVA(apm_gens,
    group_col = "Cell_type",
    gene_col = "Symbol", ExprMatList = list(science2015_ge), method = "gsva"
  )
}) -> gsva.VanAllen2015.random

gsva.VanAllen2015.random <- as.data.frame(sapply(gsva.VanAllen2015.random, function(x) {
  (x$APM - min_APS) / (max_APS - min_APS)
}))
colnames(gsva.VanAllen2015.random) <- paste0("APS", 1:100)
rownames(gsva.VanAllen2015.random) <- names(APS_VanAllen)


science2015_cli <- bind_rows(
  dplyr::select(
    science2015_cli1, patient, age_start, RECIST, overall_survival, progression_free, primary, group,
    histology, stage, gender, dead, progression, neos50
  ),
  dplyr::select(
    science2015_cli2, patient, age_start, RECIST, overall_survival, progression_free, primary,
    group, histology, stage, gender, dead, progression
  ) %>% filter(patient %in% c("Pat20", "Pat91"))
)
# combine three datasets
info_VanAllen <- full_join(full_join(science2015_cli, APS_VanAllen_df, by = "patient"),
  science2015_TMB_summary,
  by = "patient"
)

info_VanAllen$TIGS <- info_VanAllen$APS * log(info_VanAllen$nTMB + 1) # to avoid TIGS <0, TIGS = log(TMB +1) *APM for this dataset

save(info_VanAllen, file = "results/VanAllen2015_Info.RData")

Snyder et al 2017 dataset

#----------------------------------------------------------------
# Contribution of systemic and somatic factors to clinical response and resistance in urothelial cancer: an exploratory multi-omic analysis
#---------------------------------------------------------------
# Data Source: https://github.com/XSLiuLab/multi-omic-urothelial-anti-pdl1
# This is a fork repository, original link can also be found in this link

library(tidyverse)
uro_cli <- read_csv("../data/data_clinical.csv", col_types = cols())
uro_counts <- read_csv("../data/data_counts.csv", col_types = cols())
uro_variants <- read_csv("../data/data_variants.csv",
  col_types = "c??????"
)
uro_rna <- read_csv("../data/data_kallisto.csv", col_types = cols())
uro_rna_wide <- reshape2::dcast(uro_rna, gene_name ~ patient_id, value.var = "est_counts")

# try remove duplicated genes
uro_rna_wide %>%
  as.data.frame() %>%
  mutate(mean_expr = rowMeans(.[, -1], na.rm = TRUE)) %>%
  arrange(gene_name, desc(mean_expr)) %>%
  distinct(gene_name, .keep_all = TRUE) %>%
  dplyr::select(-mean_expr) %>%
  as.tibble() -> uro_rna_ge

length(intersect(uro_rna_ge$gene_name, GSE78220_FPKM$Gene))
uro_rna_ge <- dplyr::filter(uro_rna_ge, gene_name %in% GSE78220_FPKM$Gene)
uro_rna_ge_Norm <- uro_rna_ge
uro_rna_ge_Norm[, -1] <- log2(uro_rna_ge_Norm[, -1] + 1)

## apply GSVA
applyGSVA(merged_geneList,
  group_col = "Cell_type",
  gene_col = "Symbol", ExprMatList = list(uro_rna_ge_Norm), method = "gsva"
) -> gsva.Snyder

gsva.Snyder <- calc_TisIIs(gsva.Snyder[[1]])

APS_Snyder <- gsva.Snyder$APM
APS_Snyder <- (APS_Snyder - min_APS) / (max_APS - min_APS)

## save gene expression
rna_Snyder <- uro_rna_ge_Norm
if (!file.exists("results/Snyder2017_RNA.RData")) {
  save(rna_Snyder, file = "results/Snyder2017_RNA.RData")
}

APS_Snyder_df <- tibble(
  patient_id = colnames(uro_rna_ge_Norm)[-1],
  APS = APS_Snyder,
  IIS = gsva.Snyder$IIS
)

# generate results for random selected APS genes
set.seed(123456)
# apm_gens = merged_geneList %>% dplyr::filter(Cell_type == "APM")
sapply(1:100, function(i) {
  tempGens <- sample(uro_rna_ge_Norm$gene_name, 18)
  apm_gens$Symbol <- tempGens
  applyGSVA(apm_gens,
    group_col = "Cell_type",
    gene_col = "Symbol", ExprMatList = list(uro_rna_ge_Norm), method = "gsva"
  )
}) -> gsva.Snyder.random

gsva.Snyder.random <- as.data.frame(sapply(gsva.Snyder.random, function(x) {
  (x$APM - min_APS) / (max_APS - min_APS)
}))
colnames(gsva.Snyder.random) <- paste0("APS", 1:100)
rownames(gsva.Snyder.random) <- colnames(uro_rna_ge_Norm)[-1]

uro_cli_2 <- uro_cli %>%
  dplyr::select(patient_id, Age, Sex, is_benefit, is_benefit_os, os, `Alive Status`) %>%
  mutate(event = ifelse(`Alive Status` == "Y", 0, 1))

uro_tmb <- uro_variants %>%
  group_by(patient_id) %>%
  summarise(TMB = n(), Nonsyn = sum(!is.na(gene_name)), nTMB = TMB / 38) %>%
  mutate(patient_id = as.character(patient_id))

info_Snyder <- dplyr::full_join(uro_cli_2, dplyr::full_join(uro_tmb, APS_Snyder_df))
info_Snyder$TIGS <- info_Snyder$APS * log(info_Snyder$nTMB)

save(info_Snyder, file = "results/Snyder2017_Info.RData")



save(gsva.hugo2016.random, gsva.Snyder.random, gsva.VanAllen2015.random,
  file = "results/randomAPS.RData"
)

Prepare work for calculating other predictors

Load data firstly.

# library(tidyverse)
sigList <- read_csv("../data/ICBbiomarkerSignature.csv", comment = "#", col_types = cols())
load("results/Hugo2016_RNA.RData")
load("results/VanAllen2015_RNA.RData")
load("results/Snyder2017_RNA.RData")

ISG_IFNG <- read_csv("../data/ISG.RS_and_IFNG.GS.csv", comment = "#")
#> Parsed with column specification:
#> cols(
#>   Type = col_character(),
#>   GeneSymbol = col_character()
#> )

Add APS7

APS7.genes <- c(
  "B2M", "HLA-A", "HLA-B", "HLA-C",
  "TAP1", "TAP2", "TAPBP"
)

APS7 <- merged_geneList %>%
  dplyr::filter(Symbol %in% APS7.genes)

APS7.GSVA <- applyGSVA(APS7, group_col = "Cell_type", gene_col = "Symbol", ExprMatList = list(rna_hugo, rna_Snyder, rna_VanAllen), method = "gsva")

APS7.GSVA <- APS7.GSVA %>%
  setNames(c("Hugo2016", "Snyder2017", "VanAllen2015")) %>%
  purrr::map_df(function(x) {
    x %>%
      tibble::rownames_to_column() %>%
      dplyr::mutate(
        APM = scales::rescale(APM)
      )
  }, .id = "Study") %>%
  dplyr::rename(APS7 = APM)

Gene expression biomarkers for response to immune checkpoint blockade

This section we will describe how to calculate predicted value of gene signature biomarker using gene expression data.

Check if all gene signatures are in data.

all(sigList$Genes %in% rna_hugo$Gene)
#> [1] TRUE
all(sigList$Genes %in% rna_Snyder$gene_name)
#> [1] TRUE
all(sigList$Genes %in% rna_VanAllen$symbol)
#> [1] TRUE
all(ISG_IFNG$GeneSymbol %in% rna_hugo$Gene)
#> [1] FALSE
all(ISG_IFNG$GeneSymbol %in% rna_Snyder$gene_name)
#> [1] FALSE
all(ISG_IFNG$GeneSymbol %in% rna_VanAllen$symbol)
#> [1] FALSE

sigList <- ISG_IFNG %>%
  rename(
    Signature = Type,
    Genes = GeneSymbol
  ) %>% # make consistent
  dplyr::filter(Genes %in% purrr::reduce(list(rna_hugo$Gene, rna_Snyder$gene_name, rna_VanAllen$symbol), intersect)) %>% # 3 genes removed in this operation
  dplyr::bind_rows(sigList)

Next we retrieve all gene expression values and then calculate average.

rna_hugo_df <- rna_hugo %>%
  dplyr::filter(Gene %in% sigList$Genes)

rna_VanAllen_df <- rna_VanAllen %>%
  dplyr::filter(symbol %in% sigList$Genes) %>%
  rename(Gene = symbol)

rna_Snyder_df <- rna_Snyder %>%
  dplyr::filter(gene_name %in% sigList$Genes) %>%
  rename(Gene = gene_name)

Combine them with sigList, then transform these data.frame to long format, calculate mean gene expression and transform back to wide format.

rna_hugo_sigExpr <- rna_hugo_df %>%
  dplyr::right_join(sigList, by = c("Gene" = "Genes")) %>%
  tidyr::gather(key = "sample", value = "expression", -Gene, -Signature) %>%
  dplyr::group_by(sample, Signature) %>%
  dplyr::summarise(expression = mean(expression)) %>%
  tidyr::spread(key = Signature, value = expression) %>%
  dplyr::mutate(study = "Hugo2016") %>%
  dplyr::ungroup()

rna_VanAllen_sigExpr <- rna_VanAllen_df %>%
  dplyr::right_join(sigList, by = c("Gene" = "Genes")) %>%
  tidyr::gather(key = "sample", value = "expression", -Gene, -Signature) %>%
  dplyr::group_by(sample, Signature) %>%
  dplyr::summarise(expression = mean(expression)) %>%
  tidyr::spread(key = Signature, value = expression) %>%
  dplyr::mutate(study = "VanAllen2015") %>%
  dplyr::ungroup()

rna_Snyder_sigExpr <- rna_Snyder_df %>%
  dplyr::right_join(sigList, by = c("Gene" = "Genes")) %>%
  tidyr::gather(key = "sample", value = "expression", -Gene, -Signature) %>%
  dplyr::group_by(sample, Signature) %>%
  dplyr::summarise(expression = mean(expression)) %>%
  tidyr::spread(key = Signature, value = expression) %>%
  dplyr::mutate(study = "Snyder2017") %>%
  dplyr::ungroup()

Show results as tables.

DT::datatable(rna_hugo_sigExpr,
  options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE,
  caption = "gene signature expression in Hugo 2016 dataset"
)


DT::datatable(rna_VanAllen_sigExpr,
  options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE,
  caption = "gene signature expression in VanAllen 2015 dataset"
)


DT::datatable(rna_Snyder_sigExpr,
  options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE,
  caption = "gene signature expression in Snyder 2017 dataset"
)

Lastly, merge the three result datasets for downstream analysis.

GEBiomarker <- dplyr::bind_rows(rna_hugo_sigExpr, rna_VanAllen_sigExpr, rna_Snyder_sigExpr)

if (!file.exists("results/GEBiomarker.RData")) {
  save(GEBiomarker, file = "results/GEBiomarker.RData")
}

Input file generation for TIDE

TIDE is developed by Shirley Liu lab, software website is available at (http://tide.dfci.harvard.edu/). We can generate input file for TIDE and submit it to the website for TIDE score calculation.

Following suggestions provided at TIDE website, we firstly generate input files for three immunotherapy datasets .

rna_Snyder <- rna_Snyder %>%
  rename(Gene = gene_name)

rna_VanAllen <- rna_VanAllen %>%
  rename(Gene = symbol)

# normalized expression substract mean expression
generateInput <- function(df) {
  df %>%
    mutate(Ref = rowMeans(.[, -1])) %>%
    mutate_at(vars(-Gene, -Ref), `-`, quote(Ref)) %>% # f <- function(num, ref) num - !!ref
    select(-Ref) %>%
    column_to_rownames(var = "Gene") %>%
    as.data.frame()
}

TIDE_hugo <- generateInput(rna_hugo)
TIDE_VanAllen <- generateInput(rna_VanAllen)
TIDE_Snyder <- generateInput(rna_Snyder)


if (!file.exists("results/TIDE_input_hugo.tsv")) {
  write.table(TIDE_hugo, file = "results/TIDE_input_hugo.tsv", quote = FALSE, sep = "\t", row.names = TRUE, col.names = TRUE)
  write.table(TIDE_VanAllen, file = "results/TIDE_input_VanAllen.tsv", quote = FALSE, sep = "\t", row.names = TRUE, col.names = TRUE)
  write.table(TIDE_Snyder, file = "results/TIDE_input_Snyder.tsv", quote = FALSE, sep = "\t", row.names = TRUE, col.names = TRUE)
}

rm(list = ls())

Next we submit the three tsv file to TIDE website. Previous immunotherapy item is set as “No”. The result csv files are saved to report/results directory.

ROC comparison

This section we compare ROC (AUC value) of biomarkers for response prediction.

library(pROC)
#> Type 'citation("pROC")' for a citation.
#> 
#> Attaching package: 'pROC'
#> The following objects are masked from 'package:stats':
#> 
#>     cov, smooth, var
library(tidyverse)
# theme_set(theme_bw())
# library(cowplot)

#------------------------------------------
# Load data
load("results/Hugo2016_Info.RData")
load("results/VanAllen2015_Info.RData")
load("results/Snyder2017_Info.RData")
load("results/randomAPS.RData")

load("results/GEBiomarker.RData")
load("results/APS_alternative.GSVA.RData")

# Of note, output csv from TIDE website has extra "" in last line
#          please remove it before load data
TIDE_hugo <- read_csv("results/TIDE_output_hugo.csv", col_types = cols())
TIDE_Snyder <- read_csv("results/TIDE_output_Snyder.csv", col_types = cols())
TIDE_VanAllen <- read_csv("results/TIDE_output_VanAllen.csv", col_types = cols())

#------------------------------------------
# Hugo 2016
info_hugo <- info_hugo %>%
  dplyr::left_join(dplyr::filter(GEBiomarker, study == "Hugo2016") %>%
                     dplyr::select(-study), by = c("PatientID" = "sample")) %>%
  dplyr::left_join(TIDE_hugo %>% dplyr::select(Patient, TIDE), by = c("PatientID" = "Patient")) %>%  # Combine APS7
  dplyr::left_join(APS7.GSVA %>% 
                     dplyr::select(rowname, APS7), 
                   by = c("PatientID" = "rowname")) %>% 
  dplyr::rename(APS7_Sig = APS7) %>%  # differ from APS random
  dplyr::left_join(APS_MHC_II.GSVA %>% 
                     dplyr::select(rowname, APS_MHC_II), 
                   by = c("PatientID" = "rowname"))
  

# combine random data
gsva.hugo2016.random %>%
  rownames_to_column(var = "PatientID") %>%
  full_join(info_hugo, by = "PatientID") -> info_hugo

# Because less TIDE shows better outcome and
# Ref says ISG resistance signature (ISG.RS) is also associated with resistance to ICB
# We need reverse their score
# scale them to [0, 1]
# Then combine TMB with gene expression based markers
info_hugo = info_hugo %>% 
  dplyr::mutate(
    TIDE = -TIDE,
    ISG.RS = -ISG.RS
  ) %>% 
  dplyr::mutate_at(c("CD8", "IFNG", "IFNG.GS", "ISG.RS", 
                     "PDL1", "TIDE", "APS7_Sig", "APS_MHC_II"), scales::rescale) %>% 
  dplyr::mutate_at(c("CD8", "IFNG", "IFNG.GS", "ISG.RS", 
                     "PDL1", "TIDE", "APS7_Sig", "APS_MHC_II"),
                   list(comb_TMB= ~ . * log(nTMB+1)))

genROC = function(info, markers, response="Response", direction="<", APSrandom=1:100) {
  ROC_LIST = list()
  # Calc APS random
  APSrandom_list = paste0("APS", APSrandom)
  for (i in c(markers,APSrandom_list)) {
    message(i)
    ROC_LIST[[i]] = roc(info[[response]], info[[i]], direction="<")
  }
  ROC_LIST
}

suppressMessages(
  hugo2016_roc <- genROC(info_hugo, markers = setdiff(colnames(info_hugo)[106:126], 
                                                   "Treatment"))
)

#---------------------------
# VanAllen 2015

info_VanAllen <- info_VanAllen %>%
  dplyr::left_join(dplyr::filter(GEBiomarker, study == "VanAllen2015") %>%
                     dplyr::select(-study), by = c("patient" = "sample")) %>%
  dplyr::left_join(TIDE_VanAllen %>% dplyr::select(Patient, TIDE), by = c("patient" = "Patient")) %>% 
  dplyr::left_join(APS7.GSVA %>% 
                     dplyr::select(rowname, APS7), 
                   by = c("patient" = "rowname")) %>% 
  dplyr::rename(APS7_Sig = APS7) %>%  # differ from APS random
  dplyr::left_join(APS_MHC_II.GSVA %>% 
                     dplyr::select(rowname, APS_MHC_II), 
                   by = c("patient" = "rowname"))

info_VanAllen2 <- info_VanAllen %>%
  mutate(Response = case_when(
    group == "response" ~ "R",
    group == "nonresponse" ~ "NR",
    TRUE ~ NA_character_
  )) %>%
  filter(!is.na(Response))


## -- using following command if only compare patients with TIGS availble
## -- the result are basically same
# info_VanAllen2 = info_VanAllen %>%
#     mutate(Response = case_when(
#         group == "response" ~ "R",
#         group == "nonresponse" ~"NR",
#         TRUE ~ NA_character_
#     )) %>% filter(!is.na(Response), !is.na(TIGS))

# combine random data
gsva.VanAllen2015.random %>%
  rownames_to_column(var = "patient") %>%
  full_join(info_VanAllen2, by = "patient") -> info_VanAllen2

info_VanAllen2 = info_VanAllen2 %>% 
  dplyr::mutate(
    TIDE = -TIDE,
    ISG.RS = -ISG.RS
  ) %>% 
  dplyr::mutate_at(c("CD8", "IFNG", "IFNG.GS", "ISG.RS", 
                     "PDL1", "TIDE", "APS7_Sig", "APS_MHC_II"), scales::rescale) %>% 
  dplyr::mutate_at(c("CD8", "IFNG", "IFNG.GS", "ISG.RS", 
                     "PDL1", "TIDE", "APS7_Sig", "APS_MHC_II"),
                   list(comb_TMB= ~ . * log(nTMB+1)))


suppressMessages(
  van2015_roc <- genROC(info_VanAllen2, 
                     markers = setdiff(colnames(info_VanAllen2)[114:136],
                                       c("TotalMutation", "TotalNonSyn", "Response")))
)

#---------------------------------------
# Snyder 2017

info_Snyder <- info_Snyder %>%
  dplyr::left_join(dplyr::filter(GEBiomarker, study == "Snyder2017") %>%
                     dplyr::select(-study), by = c("patient_id" = "sample")) %>%
  dplyr::left_join(TIDE_Snyder %>% dplyr::select(Patient, TIDE), by = c("patient_id" = "Patient")) %>% 
  dplyr::left_join(APS7.GSVA %>% 
                     dplyr::select(rowname, APS7), 
                   by = c("patient_id" = "rowname")) %>% 
  dplyr::rename(APS7_Sig = APS7) %>%  # differ from APS random
  dplyr::left_join(APS_MHC_II.GSVA %>% 
                     dplyr::select(rowname, APS_MHC_II), 
                   by = c("patient_id" = "rowname"))

# combine random data
gsva.Snyder.random %>%
  rownames_to_column(var = "patient_id") %>%
  full_join(info_Snyder, by = "patient_id") -> info_Snyder

info_Snyder = info_Snyder %>% 
  dplyr::mutate(
    TIDE = -TIDE,
    ISG.RS = -ISG.RS
  ) %>% 
  dplyr::mutate_at(c("CD8", "IFNG", "IFNG.GS", "ISG.RS", 
                     "PDL1", "TIDE", "APS7_Sig", "APS_MHC_II"), scales::rescale) %>% 
  dplyr::mutate_at(c("CD8", "IFNG", "IFNG.GS", "ISG.RS", 
                     "PDL1", "TIDE", "APS7_Sig", "APS_MHC_II"),
                   list(comb_TMB= ~ . * log(nTMB+1)))

suppressMessages(
  snyder2017_roc <- genROC(info_Snyder, 
                     markers = colnames(info_Snyder)[111:130], response = "is_benefit")
)


ggroc(list(TMB = van2015_roc$nTMB, TIGS = van2015_roc$TIGS, TIDE = van2015_roc$TIDE),
  legacy.axes = TRUE
) +
  labs(color = "Predictor") + 
  ggpubr::theme_classic2(base_size = 12)


ggroc(list(TMB = hugo2016_roc$nTMB, TIGS = hugo2016_roc$TIGS, TIDE = hugo2016_roc$TIDE),
  legacy.axes = TRUE
) +
  labs(color = "Predictor") + 
  ggpubr::theme_classic2(base_size = 12)


ggroc(list(TMB = snyder2017_roc$nTMB, TIGS = snyder2017_roc$TIGS, TIDE = snyder2017_roc$TIDE),
  legacy.axes = TRUE
) +
  labs(color = "Predictor") + 
  ggpubr::theme_classic2(base_size = 12)



save(info_VanAllen2, info_hugo, info_Snyder, file = "results/data_for_ICB_ROC.RData")
save(van2015_roc, hugo2016_roc, snyder2017_roc, file = "results/ROC_ObjList.RData")

From comparison of AUC value above, we can conclude that:

TIGS and TIDE are better than TMB
TIGS and TIDE have almost same performance in two melanoma datasets, however, performance of TIDE go down a little in urothelial cancer dataset, maybe resulting from the fact that TIDE is trained on data of melanoma and NSCLC cancer (this point is stated at TIDE website, cancer type except for melanoma and NSCLC may not work).
APS, TMB are also not bad predictors (they have almost same performance)
Other predictors are not good predictors, they perform differently in different datasets.

Next, we summary results above for final visualization.

# summary AUC
van2015_auc = lapply(van2015_roc, function(x) as.numeric(auc(x)))
hugo2016_auc = lapply(hugo2016_roc, function(x) as.numeric(auc(x)))
snyder2017_auc = lapply(snyder2017_roc, function(x) as.numeric(auc(x)))
auc_df = list(van2015_auc, hugo2016_auc, snyder2017_auc) %>% 
  dplyr::bind_rows() %>% 
  dplyr::mutate(Group = c("VanAllen2015", "Hugo2016", "Snyder2017"), 
                APSr = rowMeans(select(., paste0("APS", 1:100)))) %>% 
  dplyr::rename(TMB = nTMB)

# Show AUC table
DT::datatable(auc_df,
  options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE,
  caption = "AUC of biomarkers for response prediction in three ICB datasets"
)

Survival comparison

Above we already show APS, TMB, TIGS and TIDE are good predictors, this section we focus on them. We use cox model and separate patients into High and Low group in K-M survival analysis based on median biomarker predicted value.

library(survival)
library(survminer)

#----------------------
# Hugo 2016
hugo_os <- read_csv("../data/cell2016_OS.csv", col_types = cols())
info_hugo <- full_join(info_hugo, hugo_os, by = "PatientID")

# Cox model

# Cox result of TMB
coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ nTMB, data = info_hugo)
#> Call:
#> coxph(formula = Surv(OverallSurvival, VitalStatus == "Dead") ~ 
#>     nTMB, data = info_hugo)
#> 
#>       coef exp(coef) se(coef)  z   p
#> nTMB -0.03      0.97     0.02 -1 0.2
#> 
#> Likelihood ratio test=3  on 1 df, p=0.08
#> n= 37, number of events= 19 
#>    (1 observation deleted due to missingness)

# Cox result of APS
coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ APS, data = info_hugo)
#> Call:
#> coxph(formula = Surv(OverallSurvival, VitalStatus == "Dead") ~ 
#>     APS, data = info_hugo)
#> 
#>     coef exp(coef) se(coef)  z   p
#> APS -2.0       0.1      1.5 -1 0.2
#> 
#> Likelihood ratio test=2  on 1 df, p=0.2
#> n= 26, number of events= 12 
#>    (12 observations deleted due to missingness)

# Cox result of TIGS
coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ TIGS, data = info_hugo)
#> Call:
#> coxph(formula = Surv(OverallSurvival, VitalStatus == "Dead") ~ 
#>     TIGS, data = info_hugo)
#> 
#>      coef exp(coef) se(coef)  z    p
#> TIGS -1.2       0.3      0.6 -2 0.03
#> 
#> Likelihood ratio test=6  on 1 df, p=0.02
#> n= 26, number of events= 12 
#>    (12 observations deleted due to missingness)

# Cox result of TIDE
coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ TIDE, data = info_hugo)
#> Call:
#> coxph(formula = Surv(OverallSurvival, VitalStatus == "Dead") ~ 
#>     TIDE, data = info_hugo)
#> 
#>      coef exp(coef) se(coef)    z   p
#> TIDE -0.7       0.5      1.3 -0.5 0.6
#> 
#> Likelihood ratio test=0.3  on 1 df, p=0.6
#> n= 26, number of events= 12 
#>    (12 observations deleted due to missingness)


# K-M fit
info_hugo %>% # take care of NA values
  mutate(
    APS_Status = ifelse(APS > median(APS, na.rm = TRUE), "High",
      ifelse(is.na(APS), NA, "Low")
    ),
    # We reverse TIDE (do TIDE=-TIDE) in previous step
    # so here, we need to change the order
    TIDE_Status = ifelse(TIDE < median(TIDE, na.rm = TRUE), "High",
      ifelse(is.na(TIDE), NA, "Low")
    ),
    TMB_Status = ifelse(nTMB > median(nTMB), "High", "Low"),
    TIGS_Status = ifelse(TIGS > median(TIGS, na.rm = TRUE), "High",
      ifelse(is.na(TIGS), NA, "Low")
    )
  ) -> info_hugo2

# effect of TMB status on survival
fit1 <- survfit(Surv(OverallSurvival, VitalStatus == "Dead") ~ TMB_Status, data = info_hugo2)
ggsurvplot(fit1,
  data = info_hugo2, pval = TRUE, fun = "pct",
  xlab = "Time (in days)"
)


# effect of TIGS status on survival
fit2 <- survfit(Surv(OverallSurvival, VitalStatus == "Dead") ~ TIGS_Status, data = info_hugo2)
ggsurvplot(fit2, data = info_hugo2, pval = TRUE, fun = "pct", xlab = "Time (in days)")



# effect of APS status on survival
fit3 <- survfit(Surv(OverallSurvival, VitalStatus == "Dead") ~ APS_Status, data = info_hugo2)
ggsurvplot(fit3,
  data = info_hugo2, pval = TRUE, fun = "pct",
  xlab = "Time (in days)"
)


# effect of TIDE status on survival
fit4 <- survfit(Surv(OverallSurvival, VitalStatus == "Dead") ~ TIDE_Status, data = info_hugo2)
ggsurvplot(fit4, data = info_hugo2, pval = TRUE, fun = "pct", xlab = "Time (in days)")


#----------------------
# Van Allen 2015

# Cox model

# Cox result of TMB
coxph(Surv(overall_survival, dead == 1) ~ nTMB, data = info_VanAllen)
#> Call:
#> coxph(formula = Surv(overall_survival, dead == 1) ~ nTMB, data = info_VanAllen)
#> 
#>        coef exp(coef) se(coef)  z   p
#> nTMB -0.007     0.993    0.006 -1 0.2
#> 
#> Likelihood ratio test=2  on 1 df, p=0.2
#> n= 110, number of events= 83 
#>    (2 observations deleted due to missingness)

# Cox result of APS
coxph(Surv(overall_survival, dead == 1) ~ APS, data = info_VanAllen)
#> Call:
#> coxph(formula = Surv(overall_survival, dead == 1) ~ APS, data = info_VanAllen)
#> 
#>     coef exp(coef) se(coef)  z    p
#> APS -1.7       0.2      0.8 -2 0.03
#> 
#> Likelihood ratio test=5  on 1 df, p=0.02
#> n= 42, number of events= 29 
#>    (70 observations deleted due to missingness)

# Cox result of TIGS
coxph(Surv(overall_survival, dead == 1) ~ TIGS, data = info_VanAllen)
#> Call:
#> coxph(formula = Surv(overall_survival, dead == 1) ~ TIGS, data = info_VanAllen)
#> 
#>      coef exp(coef) se(coef)  z    p
#> TIGS -0.5       0.6      0.3 -2 0.06
#> 
#> Likelihood ratio test=5  on 1 df, p=0.03
#> n= 40, number of events= 27 
#>    (72 observations deleted due to missingness)

# Cox result of TIDE
coxph(Surv(overall_survival, dead == 1) ~ TIDE, data = info_VanAllen)
#> Call:
#> coxph(formula = Surv(overall_survival, dead == 1) ~ TIDE, data = info_VanAllen)
#> 
#>      coef exp(coef) se(coef) z     p
#> TIDE  0.5       1.7      0.2 3 0.003
#> 
#> Likelihood ratio test=8  on 1 df, p=0.004
#> n= 42, number of events= 29 
#>    (70 observations deleted due to missingness)

info_VanAllen2 %>%
  #filter(group != "long-survival") %>%
  mutate(
    APS_Status = ifelse(APS > median(APS, na.rm = TRUE), "High",
      ifelse(is.na(APS), NA, "Low")
    ),
    # We reverse TIDE (do TIDE=-TIDE) in previous step
    # so here, we need to change the order
    TIDE_Status = ifelse(TIDE < median(TIDE, na.rm = TRUE), "High",
      ifelse(is.na(TIDE), NA, "Low")
    ),
    TMB_Status = ifelse(nTMB > median(nTMB, na.rm = TRUE), "High",
      ifelse(is.na(nTMB), NA, "Low")
    ),
    TIGS_Status = ifelse(TIGS > median(TIGS, na.rm = TRUE), "High",
      ifelse(is.na(TIGS), NA, "Low")
    )
  ) -> info_VanAllen3

# effect of TMB status on survival
fit1 <- survfit(Surv(overall_survival, dead == 1) ~ TMB_Status, data = info_VanAllen3)
ggsurvplot(fit1,
  data = info_VanAllen3, pval = TRUE, fun = "pct",
  xlab = "Time (in days)"
)


# effect of TIGS status on survival
fit2 <- survfit(Surv(overall_survival, dead == 1) ~ TIGS_Status, data = info_VanAllen3)
ggsurvplot(fit2, data = info_VanAllen3, pval = TRUE, fun = "pct", xlab = "Time (in days)")



# effect of APS status on survival
fit3 <- survfit(Surv(overall_survival, dead == 1) ~ APS_Status, data = info_VanAllen3)
ggsurvplot(fit3,
  data = info_VanAllen3, pval = TRUE, fun = "pct",
  xlab = "Time (in days)"
)


# effect of TIDE status on survival
fit4 <- survfit(Surv(overall_survival, dead == 1) ~ TIDE_Status, data = info_VanAllen3)
ggsurvplot(fit4, data = info_VanAllen3, pval = TRUE, fun = "pct", xlab = "Time (in days)")


#----------------------
# Snyder 2017

# Cox model

# Cox result of TMB
coxph(Surv(os, event) ~ nTMB, data = info_Snyder)
#> Call:
#> coxph(formula = Surv(os, event) ~ nTMB, data = info_Snyder)
#> 
#>       coef exp(coef) se(coef)  z   p
#> nTMB -0.02      0.98     0.02 -1 0.3
#> 
#> Likelihood ratio test=1  on 1 df, p=0.3
#> n= 22, number of events= 14 
#>    (6 observations deleted due to missingness)

# Cox result of APS
coxph(Surv(os, event) ~ APS, data = info_Snyder)
#> Call:
#> coxph(formula = Surv(os, event) ~ APS, data = info_Snyder)
#> 
#>     coef exp(coef) se(coef)    z   p
#> APS -1.0       0.4      1.1 -0.9 0.4
#> 
#> Likelihood ratio test=0.8  on 1 df, p=0.4
#> n= 25, number of events= 17 
#>    (3 observations deleted due to missingness)

# Cox result of TIGS
coxph(Surv(os, event) ~ TIGS, data = info_Snyder)
#> Call:
#> coxph(formula = Surv(os, event) ~ TIGS, data = info_Snyder)
#> 
#>      coef exp(coef) se(coef)  z   p
#> TIGS -0.5       0.6      0.3 -1 0.2
#> 
#> Likelihood ratio test=2  on 1 df, p=0.1
#> n= 22, number of events= 14 
#>    (6 observations deleted due to missingness)

# Cox result of TIDE
coxph(Surv(os, event) ~ TIDE, data = info_Snyder)
#> Call:
#> coxph(formula = Surv(os, event) ~ TIDE, data = info_Snyder)
#> 
#>      coef exp(coef) se(coef) z   p
#> TIDE    1         3        1 1 0.2
#> 
#> Likelihood ratio test=1  on 1 df, p=0.2
#> n= 25, number of events= 17 
#>    (3 observations deleted due to missingness)


info_Snyder %>%
  mutate(
    APS_Status = ifelse(APS > median(APS, na.rm = TRUE), "High",
      ifelse(is.na(APS), NA, "Low")
    ),
    # We reverse TIDE (do TIDE=-TIDE) in previous step
    # so here, we need to change the order
    TIDE_Status = ifelse(TIDE < median(TIDE, na.rm = TRUE), "High",
      ifelse(is.na(TIDE), NA, "Low")
    ),
    TMB_Status = ifelse(nTMB > median(nTMB, na.rm = TRUE), "High",
      ifelse(is.na(nTMB), NA, "Low")
    ),
    TIGS_Status = ifelse(TIGS > median(TIGS, na.rm = TRUE), "High",
      ifelse(is.na(TIGS), NA, "Low")
    )
  ) -> info_Snyder2

# effect of TMB status on survival
fit1 <- survfit(Surv(os, event) ~ TMB_Status, data = info_Snyder2)
ggsurvplot(fit1,
  data = info_Snyder2, pval = TRUE, fun = "pct",
  xlab = "Time (in days)"
)


# effect of TIGS status on survival
fit2 <- survfit(Surv(os, event) ~ TIGS_Status, data = info_Snyder2)
ggsurvplot(fit2, data = info_Snyder2, pval = TRUE, fun = "pct", xlab = "Time (in days)")



# effect of APS status on survival
fit3 <- survfit(Surv(os, event) ~ APS_Status, data = info_Snyder2)
ggsurvplot(fit3,
  data = info_Snyder2, pval = TRUE, fun = "pct",
  xlab = "Time (in days)"
)


# effect of TIDE status on survival
fit4 <- survfit(Surv(os, event) ~ TIDE_Status, data = info_Snyder2)
ggsurvplot(fit4, data = info_Snyder2, pval = TRUE, fun = "pct", xlab = "Time (in days)")

From comparison of cox results and K-M plots above, we can conclude that:

TIGS is very robust, in all three datasets, “High group” all show statistically significant better survival than “Low group”.
TIDE performs just so-so in Hugo 2016 dataset, very good in Van Allen 2015 dataset. But, it losts its power in Snyder 2017 dataset, patients with low TIDE value should have better survival but in this dataset patients with high TIDE value have better survival.
Patients with High TMB or APS in three datasets all show trends of better survival but not statistically significant.

Next, we summary results (cox HR value and K-M fit) above for final visualization.

# summary cox result
cox_TMB <- list()
cox_APS <- list()
cox_TIGS <- list()
cox_TIDE <- list()

summary(coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ nTMB, data = info_hugo)) -> cox_TMB[[1]]
summary(coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ APS, data = info_hugo)) -> cox_APS[[1]]
summary(coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ TIGS, data = info_hugo)) -> cox_TIGS[[1]]
summary(coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ TIDE, data = info_hugo)) -> cox_TIDE[[1]]

summary(coxph(Surv(overall_survival, dead == 1) ~ nTMB, data = info_VanAllen)) -> cox_TMB[[2]]
summary(coxph(Surv(overall_survival, dead == 1) ~ APS, data = info_VanAllen)) -> cox_APS[[2]]
summary(coxph(Surv(overall_survival, dead == 1) ~ TIGS, data = info_VanAllen)) -> cox_TIGS[[2]]
summary(coxph(Surv(overall_survival, dead == 1) ~ TIDE, data = info_VanAllen)) -> cox_TIDE[[2]]

summary(coxph(Surv(os, event) ~ nTMB, data = info_Snyder)) -> cox_TMB[[3]]
summary(coxph(Surv(os, event) ~ APS, data = info_Snyder)) -> cox_APS[[3]]
summary(coxph(Surv(os, event) ~ TIGS, data = info_Snyder)) -> cox_TIGS[[3]]
summary(coxph(Surv(os, event) ~ TIDE, data = info_Snyder)) -> cox_TIDE[[3]]

cox_df <- tibble(
  Group = c(rep("Hugo2016", 4), rep("VanAllen2015", 4), rep("Snyder2017", 4)),
  Biomarker = rep(c("TMB", "APS", "TIGS", "TIDE"), 3),
  HR = c(
    as.numeric(cox_TMB[[1]]$conf.int[1]),
    as.numeric(cox_APS[[1]]$conf.int[1]),
    as.numeric(cox_TIGS[[1]]$conf.int[1]),
    as.numeric(cox_TIDE[[1]]$conf.int[1]),
    as.numeric(cox_TMB[[2]]$conf.int[1]),
    as.numeric(cox_APS[[2]]$conf.int[1]),
    as.numeric(cox_TIGS[[2]]$conf.int[1]),
    as.numeric(cox_TIDE[[2]]$conf.int[1]),
    as.numeric(cox_TMB[[3]]$conf.int[1]),
    as.numeric(cox_APS[[3]]$conf.int[1]),
    as.numeric(cox_TIGS[[3]]$conf.int[1]),
    as.numeric(cox_TIDE[[3]]$conf.int[1])
  ),
  Pvalue = c(
    as.numeric(cox_TMB[[1]]$logtest["pvalue"]),
    as.numeric(cox_APS[[1]]$logtest["pvalue"]),
    as.numeric(cox_TIGS[[1]]$logtest["pvalue"]),
    as.numeric(cox_TIDE[[1]]$logtest["pvalue"]),
    as.numeric(cox_TMB[[2]]$logtest["pvalue"]),
    as.numeric(cox_APS[[2]]$logtest["pvalue"]),
    as.numeric(cox_TIGS[[2]]$logtest["pvalue"]),
    as.numeric(cox_TIDE[[2]]$logtest["pvalue"]),
    as.numeric(cox_TMB[[3]]$logtest["pvalue"]),
    as.numeric(cox_APS[[3]]$logtest["pvalue"]),
    as.numeric(cox_TIGS[[3]]$logtest["pvalue"]),
    as.numeric(cox_TIDE[[3]]$logtest["pvalue"])
  )
)

# show summary data as table
DT::datatable(cox_df,
  options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE,
  caption = "HR of biomarkers for response prediction in three ICB datasets"
)



# summary K-M fit
km_TMB <- list()
km_APS <- list()
km_TIGS <- list()
km_TIDE <- list()


km_TMB[["Hugo"]] <- survfit(Surv(OverallSurvival, VitalStatus == "Dead") ~ TMB_Status, data = info_hugo2)
km_TIGS[["Hugo"]] <- survfit(Surv(OverallSurvival, VitalStatus == "Dead") ~ TIGS_Status, data = info_hugo2)
km_APS[["Hugo"]] <- survfit(Surv(OverallSurvival, VitalStatus == "Dead") ~ APS_Status, data = info_hugo2)
km_TIDE[["Hugo"]] <- survfit(Surv(OverallSurvival, VitalStatus == "Dead") ~ TIDE_Status, data = info_hugo2)

km_TMB[["VanAllen"]] <- survfit(Surv(overall_survival, dead == 1) ~ TMB_Status, data = info_VanAllen3)
km_TIGS[["VanAllen"]] <- survfit(Surv(overall_survival, dead == 1) ~ TIGS_Status, data = info_VanAllen3)
km_APS[["VanAllen"]] <- survfit(Surv(overall_survival, dead == 1) ~ APS_Status, data = info_VanAllen3)
km_TIDE[["VanAllen"]] <- survfit(Surv(overall_survival, dead == 1) ~ TIDE_Status, data = info_VanAllen3)


km_TMB[["Snyder"]] <- survfit(Surv(os, event) ~ TMB_Status, data = info_Snyder2)
km_TIGS[["Snyder"]] <- survfit(Surv(os, event) ~ TIGS_Status, data = info_Snyder2)
km_APS[["Snyder"]] <- survfit(Surv(os, event) ~ APS_Status, data = info_Snyder2)
km_TIDE[["Snyder"]] <- survfit(Surv(os, event) ~ TIDE_Status, data = info_Snyder2)


km_fit <- list(TMB = km_TMB, APS = km_APS, TIGS = km_TIGS, TIDE = km_TIDE)

if (!file.exists("results/cox_km_result.RData")) {
  save(cox_df, km_fit, file = "results/cox_km_result.RData")
}

Visualization of comparison profile

This section we visualize comparsion profile, integrate analysis results of 3 immunotherapy datasets into a big figure.

# make sure these package are loaded
library(pROC)
# library(ggsci)
library(ggplot2)
suppressPackageStartupMessages(library(cowplot))

Integrate ROC curve.

theme_set(cowplot::theme_cowplot())

## -------- ROC

# Van Allen
p_roc_1 <- ggroc(list(TMB = van2015_roc$nTMB,
                      TIGS = van2015_roc$TIGS,
                      TIDE = van2015_roc$TIDE), legacy.axes = TRUE, aes = c("color", "linetype")) +
  scale_color_manual("", values = c("blue", "red", "black")) +
  scale_linetype_discrete("") + 
  theme(legend.position = c(0.8, 0.4)) + 
  geom_segment(aes(x=0, xend=1, y =0, yend=1), color = "grey", linetype = "dashed")
# Hugo
p_roc_2 <- ggroc(list(TMB = hugo2016_roc$nTMB,
                      TIGS = hugo2016_roc$TIGS,
                      TIDE = hugo2016_roc$TIDE), legacy.axes = TRUE, aes = c("color", "linetype")) +
  scale_color_manual("", values = c("blue", "red", "black")) +
  scale_linetype_discrete("") + 
  theme(legend.position = c(0.8, 0.4)) +
  geom_segment(aes(x=0, xend=1, y =0, yend=1), color = "grey", linetype = "dashed")
# Snyder
p_roc_3 <- ggroc(list(TMB = snyder2017_roc$nTMB,
                      TIGS = snyder2017_roc$TIGS,
                      TIDE = snyder2017_roc$TIDE), legacy.axes = TRUE, aes = c("color", "linetype")) +
  scale_color_manual("", values = c("blue", "red", "black")) +
  scale_linetype_discrete("") + 
  theme(legend.position = c(0.8, 0.4)) +
  geom_segment(aes(x=0, xend=1, y =0, yend=1), color = "grey", linetype = "dashed")

p_roc_4 <- ggroc(list(
  TIGS = van2015_roc$TIGS, APS = van2015_roc$APS, PDL1 = van2015_roc$PDL1,
  CD8 = van2015_roc$CD8, IFNG = van2015_roc$IFNG, IIS = van2015_roc$IIS, 
  ISG.RS = van2015_roc$ISG.RS, IFNG.GS = van2015_roc$IFNG.GS), aes = c("color", "linetype"),
  legacy.axes = TRUE) +
  scale_color_discrete("") +
  scale_linetype_discrete("") + 
  theme(legend.position = c(0.8, 0.4)) + 
  geom_segment(aes(x=0, xend=1, y =0, yend=1), color = "grey", linetype = "dashed")

p_roc_5 <- ggroc(list(
  TIGS = hugo2016_roc$TIGS, APS = hugo2016_roc$APS, PDL1 = hugo2016_roc$PDL1,
  CD8 = hugo2016_roc$CD8, IFNG = hugo2016_roc$IFNG, IIS = hugo2016_roc$IIS,
  ISG.RS = hugo2016_roc$ISG.RS, IFNG.GS = hugo2016_roc$IFNG.GS), aes = c("color", "linetype"),
  legacy.axes = TRUE) +
  scale_color_discrete("") +
  scale_linetype_discrete("") + 
  theme(legend.position = c(0.8, 0.4)) + 
  geom_segment(aes(x=0, xend=1, y =0, yend=1), color = "grey", linetype = "dashed")

p_roc_6 <- ggroc(list(
  TIGS = snyder2017_roc$TIGS, APS = snyder2017_roc$APS, PDL1 = snyder2017_roc$PDL1,
  CD8 = snyder2017_roc$CD8, IFNG = snyder2017_roc$IFNG, IIS = snyder2017_roc$IIS,
  ISG.RS = snyder2017_roc$ISG.RS, IFNG.GS = snyder2017_roc$IFNG.GS), aes = c("color", "linetype"),
  legacy.axes = TRUE) +
  scale_color_discrete("") +
  scale_linetype_discrete("") + 
  theme(legend.position = c(0.8, 0.4)) +
  geom_segment(aes(x=0, xend=1, y =0, yend=1), color = "grey", linetype = "dashed")


p_extra <- plot_grid(p_roc_4, p_roc_5, p_roc_6, nrow = 1)
save_plot(filename = "extra_roc_comparison.pdf", plot = p_extra, nrow = 1, 
          ncol = 3, base_aspect_ratio = 1.5)

show_cols = c("APS", "APSr", "IIS", "TMB", "TIGS", "TIDE", "CD8", "PDL1", "IFNG", "IFNG.GS", "ISG.RS")
show_extcols = c("TIDE_comb_TMB", "APS7_Sig", "APS7_Sig_comb_TMB")

## -------- AUC values
auc_df_long <- auc_df %>%
  dplyr::select(c(colnames(auc_df)[1:20], "APSr", "Group")) %>% 
  tidyr::gather(key = "biomarker", value = "auc", -Group)

# p_auc_1 =
auc_df_long$auc <- round(auc_df_long$auc, digits = 3)

# AUC for PHBR comes from result of code/PHBR_comparison_*.R

auc_df_long %>%
  filter(Group == "VanAllen2015", biomarker %in% show_cols) %>%
  ggplot(aes(x = biomarker, y = auc)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  geom_text(aes(label = auc), hjust = 1) +
  geom_hline(yintercept = 0.5, linetype = 2) +
  ylim(0, 0.85) + ylab("AUC") + theme(axis.title.y = element_blank()) +
  coord_flip() -> p_auc_1

auc_df_long %>%
  filter(Group == "Hugo2016", biomarker %in% show_cols) %>%
  ggplot(aes(x = biomarker, y = auc)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  geom_text(aes(label = auc), hjust = 1) +
  geom_hline(yintercept = 0.5, linetype = 2) +
  ylim(0, 0.8) + ylab("AUC") + theme(axis.title.y = element_blank()) +
  coord_flip() -> p_auc_2

auc_df_long %>%
  filter(Group == "Snyder2017", biomarker %in% show_cols) %>%
  ggplot(aes(x = biomarker, y = auc)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  geom_text(aes(label = auc), hjust = 1) +
  geom_hline(yintercept = 0.5, linetype = 2) +
  ylim(0, 0.8) + ylab("AUC") + theme(axis.title.y = element_blank()) +
  coord_flip() -> p_auc_3

auc_df_long %>%
  filter(Group == "VanAllen2015", biomarker %in% show_extcols) %>%
  add_row(Group="VanAllen2015", biomarker=c("PHBR_I", "PHBR_I_comb_TMB"), auc=c(0.57, 0.71)) %>%  
  ggplot(aes(x = biomarker, y = auc)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  geom_text(aes(label = auc), hjust = 1) +
  geom_hline(yintercept = 0.5, linetype = 2) +
  ylim(0, 0.85) + ylab("AUC") + theme(axis.title.y = element_blank()) +
  coord_flip() -> p_auc_ext1

auc_df_long %>%
  filter(Group == "Hugo2016", biomarker %in% show_extcols) %>%
  ggplot(aes(x = biomarker, y = auc)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  geom_text(aes(label = auc), hjust = 1) +
  geom_hline(yintercept = 0.5, linetype = 2) +
  ylim(0, 0.8) + ylab("AUC") + theme(axis.title.y = element_blank()) +
  coord_flip() -> p_auc_ext2

auc_df_long %>%
  filter(Group == "Snyder2017", biomarker %in% show_extcols) %>%
  add_row(Group="Snyder2017", biomarker=c("PHBR_I", "PHBR_I_comb_TMB"), auc=c(0.68, 0.52)) %>%  
  ggplot(aes(x = biomarker, y = auc)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  geom_text(aes(label = auc), hjust = 1) +
  geom_hline(yintercept = 0.5, linetype = 2) +
  ylim(0, 0.8) + ylab("AUC") + theme(axis.title.y = element_blank()) +
  coord_flip() -> p_auc_ext3

auc_ext = cowplot::plot_grid(p_auc_ext1 + ggtitle("Van Allen et al 2015"), 
                             p_auc_ext2 + ggtitle("Hugo et al 2016"), 
                             p_auc_ext3 + ggtitle("Snyder et al 2017"), nrow = 1) 
ggsave("extra_auc.pdf", auc_ext, width = 12, height = 2.5)

#--- KM-plots
p_surv_APS <- list()
p_surv_TMB <- list()
p_surv_TIGS <- list()
p_surv_TIDE <- list()

p_surv_APS[[1]] <- ggsurvplot(km_APS$VanAllen,
  pval = TRUE, fun = "pct",
  xlab = "Time (in days)", palette = c("red", "blue"),
  legend = c(0.8, 0.9), legend.title = element_blank(), legend.labs = c("APS High", "APS Low")
)
p_surv_APS[[2]] <- ggsurvplot(km_APS$Hugo,
  pval = TRUE, fun = "pct",
  xlab = "Time (in days)", palette = c("red", "blue"),
  legend = c(0.8, 0.9), legend.title = element_blank(), legend.labs = c("APS High", "APS Low")
)
p_surv_APS[[3]] <- ggsurvplot(km_APS$Snyder,
  pval = TRUE, fun = "pct",
  xlab = "Time (in days)", palette = c("red", "blue"),
  legend = c(0.8, 0.9), legend.title = element_blank(), legend.labs = c("APS High", "APS Low")
)

p_surv_TMB[[1]] <- ggsurvplot(km_TMB$VanAllen,
  pval = TRUE, fun = "pct",
  xlab = "Time (in days)", palette = c("red", "blue"),
  legend = c(0.8, 0.9), legend.title = element_blank(), legend.labs = c("TMB High", "TMB Low")
)
p_surv_TMB[[2]] <- ggsurvplot(km_TMB$Hugo,
  pval = TRUE, fun = "pct",
  xlab = "Time (in days)", palette = c("red", "blue"),
  legend = c(0.8, 0.9), legend.title = element_blank(), legend.labs = c("TMB High", "TMB Low")
)
p_surv_TMB[[3]] <- ggsurvplot(km_TMB$Snyder,
  pval = TRUE, fun = "pct",
  xlab = "Time (in days)", palette = c("red", "blue"),
  legend = c(0.8, 0.9), legend.title = element_blank(), legend.labs = c("TMB High", "TMB Low")
)

p_surv_TIGS[[1]] <- ggsurvplot(km_TIGS$VanAllen,
  pval = TRUE, fun = "pct",
  xlab = "Time (in days)", palette = c("red", "blue"),
  legend = c(0.8, 0.9), legend.title = element_blank(), legend.labs = c("TIGS High", "TIGS Low")
)
p_surv_TIGS[[2]] <- ggsurvplot(km_TIGS$Hugo,
  pval = TRUE, fun = "pct",
  xlab = "Time (in days)", palette = c("red", "blue"),
  legend = c(0.8, 0.4), legend.title = element_blank(), legend.labs = c("TIGS High", "TIGS Low")
)
p_surv_TIGS[[3]] <- ggsurvplot(km_TIGS$Snyder,
  pval = TRUE, fun = "pct",
  xlab = "Time (in days)", palette = c("red", "blue"),
  legend = c(0.8, 0.9), legend.title = element_blank(), legend.labs = c("TIGS High", "TIGS Low")
)

p_surv_TIDE[[1]] <- ggsurvplot(km_TIDE$VanAllen,
  pval = TRUE, fun = "pct",
  xlab = "Time (in days)", palette = c("red", "blue"),
  legend = c(0.8, 0.9), legend.title = element_blank(), legend.labs = c("TIDE High", "TIDE Low")
)
p_surv_TIDE[[2]] <- ggsurvplot(km_TIDE$Hugo,
  pval = TRUE, fun = "pct",
  xlab = "Time (in days)", palette = c("red", "blue"),
  legend = c(0.8, 0.9), legend.title = element_blank(), legend.labs = c("TIDE High", "TIDE Low")
)
p_surv_TIDE[[3]] <- ggsurvplot(km_TIDE$Snyder,
  pval = TRUE, fun = "pct",
  xlab = "Time (in days)", palette = c("red", "blue"),
  legend = c(0.8, 0.9), legend.title = element_blank(), legend.labs = c("TIDE High", "TIDE Low")
)


p_aps <- plot_grid(plotlist = lapply(p_surv_APS, function(x) x$plot), ncol = 3)
save_plot(filename = "aps_kmplot.pdf", plot = p_aps, ncol = 3, base_aspect_ratio = 1.3)
# ggpar(p , font.legend = c(12, "bold", "black"))

plot_list = list(
  p_roc_1, p_roc_2, p_roc_3,
  p_auc_1, p_auc_2, p_auc_3,
  p_surv_TMB[[1]]$plot, p_surv_TMB[[2]]$plot, p_surv_TMB[[3]]$plot,
  p_surv_TIGS[[1]]$plot, p_surv_TIGS[[2]]$plot, p_surv_TIGS[[3]]$plot,
  p_surv_TIDE[[1]]$plot, p_surv_TIDE[[2]]$plot, p_surv_TIDE[[3]]$plot
)

# plot_list <- list(
#   p_roc_1, p_auc_1, p_surv_TMB[[1]]$plot, p_surv_TIGS[[1]]$plot, p_surv_TIDE[[1]]$plot,
#   p_roc_2, p_auc_2, p_surv_TMB[[2]]$plot, p_surv_TIGS[[2]]$plot, p_surv_TIDE[[2]]$plot,
#   p_roc_3, p_auc_3, p_surv_TMB[[3]]$plot, p_surv_TIGS[[3]]$plot, p_surv_TIDE[[3]]$plot
# )

# plot_grid(p_roc_1, p_roc_2, p_roc_3,
#           p_auc_1, p_auc_2, p_auc_3,
#           p_surv_TMB[[1]]$plot, p_surv_TMB[[2]]$plot, p_surv_TMB[[3]]$plot,
#           p_surv_TIGS[[1]]$plot, p_surv_TIGS[[2]]$plot, p_surv_TIGS[[3]]$plot,
#           p_surv_TIDE[[1]]$plot, p_surv_TIDE[[2]]$plot, p_surv_TIDE[[3]]$plot, nrow = 3, ncol=5,
#           align = "vh")
plot_res <- plot_grid(
  plotlist = plot_list, nrow = 5, ncol = 3
)
save_plot("comp_profile.png", plot_res, nrow = 5, ncol = 3)

knitr::include_graphics("comp_profile.png")

Save as pdf for publication.

save_plot("comp_profile.pdf", plot_res, nrow = 5, ncol = 3)

Session info

sessionInfo()
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: CentOS Linux 7 (Core)
#> 
#> Matrix products: default
#> BLAS:   /home/public/R/R-base/lib64/R/lib/libRblas.so
#> LAPACK: /home/public/R/R-base/lib64/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] grid      stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] cowplot_1.0.0       survminer_0.4.6     pROC_1.15.3        
#>  [4] ggrepel_0.8.1       forestplot_1.9      checkmate_1.9.4    
#>  [7] survival_3.1-7      ggpubr_0.2.3        magrittr_1.5       
#> [10] scales_1.0.0        corrplot_0.84       pheatmap_1.0.12    
#> [13] forcats_0.4.0       stringr_1.4.0       purrr_0.3.3        
#> [16] tidyr_1.0.0         tibble_2.1.3        ggplot2_3.2.1      
#> [19] tidyverse_1.2.1     readr_1.3.1         dplyr_0.8.3        
#> [22] UCSCXenaTools_1.2.7 rmdformats_0.3.5    knitr_1.26         
#> [25] pacman_0.5.1       
#> 
#> loaded via a namespace (and not attached):
#>  [1] utf8_1.1.4             questionr_0.7.0        R.utils_2.9.0         
#>  [4] tidyselect_0.2.5       lme4_1.1-21            robust_0.4-18.1       
#>  [7] htmlwidgets_1.5.1      munsell_0.5.0          codetools_0.2-16      
#> [10] DT_0.10                future_1.15.0          miniUI_0.1.1.1        
#> [13] withr_2.1.2            Brobdingnag_1.2-6      metaBMA_0.6.2         
#> [16] colorspace_1.4-1       highr_0.8              rstudioapi_0.10       
#> [19] stats4_3.6.0           DescTools_0.99.30      robustbase_0.93-5     
#> [22] rcompanion_2.3.7       ggsignif_0.6.0         listenv_0.7.0         
#> [25] labeling_0.3           emmeans_1.4.2          rstan_2.19.2          
#> [28] KMsurv_0.1-5           mnormt_1.5-5           MCMCpack_1.4-4        
#> [31] bridgesampling_0.7-2   coda_0.19-3            vctrs_0.2.0           
#> [34] generics_0.0.2         TH.data_1.0-10         metafor_2.1-0         
#> [37] xfun_0.11              R6_2.4.0               BayesFactor_0.9.12-4.2
#> [40] reshape_0.8.8          logspline_2.1.15       assertthat_0.2.1      
#> [43] promises_1.1.0         multcomp_1.4-10        ggExtra_0.9           
#> [46] gtable_0.3.0           multcompView_0.1-7     globals_0.12.4        
#> [49] processx_3.4.1         mcmc_0.9-6             sandwich_2.5-1        
#> [52] rlang_0.4.1            MatrixModels_0.4-1     EMT_1.1               
#> [55] zeallot_0.1.0          splines_3.6.0          TMB_1.7.15            
#> [58] lazyeval_0.2.2         broom_0.5.2            inline_0.3.15         
#> [61] abind_1.4-5            yaml_2.2.0             reshape2_1.4.3        
#> [64] modelr_0.1.5           crosstalk_1.0.0        backports_1.1.5       
#> [67] httpuv_1.5.2           tools_3.6.0            bookdown_0.15         
#> [70] psych_1.8.12           ellipsis_0.3.0         RColorBrewer_1.1-2    
#> [73] WRS2_1.0-0             ez_4.4-0               Rcpp_1.0.3            
#>  [ reached getOption("max.print") -- omitted 100 entries ]

Supplementary analyses

Random selection of APM genes or IIS genes

TCGA contains about 10,000 patients with RNASeq data, it is hard to simulate GSVA serveral times with whole data (too much computation), thus we using a sampling strategy as following:

randomly select RNAseq data from 500 patients.
calculate GSVA score
- calculate once from right genes.
- randomly select genes (10 times) and run GSVA.
calculate spearman correlation coefficient for normal APS and normal IIS, random APS and normal IIS, normal APS and random IIS, respectively. The latter two have ten values, we use their mean.
repeat the process 100 times.

The GSVA score calculation is implemented by script code/random_simulation.R (this will take much time to finish). Next we calculate correlation using following commands.

load("results/randomRes/normal_results.RData")
load("results/randomRes/random_res.RData")

source("../code/functions.R")
cor_res <- data.frame(stringsAsFactors = FALSE)
for (i in 1:100) {
  normal <- normal_res[[i]][[1]]
  normal <- calc_TisIIs(normal)

  r1 <- as.numeric(cor.test(normal$APM, normal$IIS, method = "spearman")$estimate)

  r2 <- c() # normal APM, random IIS
  r3 <- c() # random APM, normal IIS
  for (j in 1:10) {
    random <- random_res[[i]][[j]][[1]]
    tryCatch({
      random <- calc_TisIIs(random)
      r_random1 <- as.numeric(cor.test(normal$APM, random$IIS, method = "spearman")$estimate)
      r2 <- c(r2, r_random1)
      r_random2 <- as.numeric(cor.test(random$APM, normal$IIS, method = "spearman")$estimate)
      r3 <- c(r3, r_random2)
    },
    error = function(e) {
      NA
    }
    )
  }
  r2 <- mean(r2, na.rm = TRUE)
  r3 <- mean(r3, na.rm = TRUE)
  cor_res <- rbind(cor_res, c(r1, r2, r3))
}

colnames(cor_res) <- c("normal", "random_IIS", "random_APS")

write_tsv(cor_res, "results/randomGSVA.tsv")

Now, we plot spearman correlation efficient of 3 types.

library(tidyverse)
library(cowplot)
cor_res <- read_tsv("results/randomGSVA.tsv", col_types = cols())

cor_res_long <- gather(cor_res, key = "Type", value = "SpearmanCoeff")
cor_res_long$Type <- factor(cor_res_long$Type)

p <- ggplot(cor_res_long, aes(x = SpearmanCoeff, color = Type)) +
  stat_density(geom = "line") +
  xlab("Spearman correlation coefficient") + ylab("Density") +
  guides(color = guide_legend(title = NULL)) +
  scale_color_discrete(labels = c("r(APS, IIS)", "r(APSr, IIS)", "r(APS, IISr"))
p

# save
save_plot("random_simulation.pdf", plot = p, base_aspect_ratio = 1.6)

We can clearly see that if we change gene list for APS or IIS calculation, the strong positive correlation between APS and IIS will no longer exist.

Explore correlation between ORR and TIGS from TCGA data

A reviewer commented on our analysis which showed in section “TIGS definition and performance”:

it appears that the authors lumped together data from different data sets (e.g. reference 26 for tumor mutational burden and TCGA data for expression)

This analysis we adopted same idea from Tumor Mutational Burden and Response Rate to PD-1 inhibition. To further check reliability of this analysis, here we use gene expression and TMB data from TCGA and ORR data from clinical literatures.

Firstly we load TCGA data and calculate TIGS score.

library(dplyr)
load("results/TCGA_ALL.RData")

tcga_all <- tcga_all %>%
  mutate(
    nAPM = (APM - min(APM, na.rm = TRUE)) / (max(APM, na.rm = TRUE) - min(APM, na.rm = TRUE)),
    nTMB = TMB_NonsynVariants / 38,
    TIGS = log(nTMB + 1) * nAPM
  )

Next we calculate median of APS, TMB, TIGS and their corresponding sample counts for each tumor type.

tcga_tigs <- tcga_all %>%
  dplyr::filter(!is.na(Project)) %>%
  group_by(Project) %>%
  summarise(
    Patients_TIGS = sum(!is.na(TIGS)), TIGS = median(TIGS, na.rm = TRUE),
    Patients_APS = sum(!is.na(nAPM)), APS = median(nAPM, na.rm = TRUE),
    Patients_TMB = sum(!is.na(nTMB)), TMB = median(nTMB, na.rm = TRUE)
  )
DT::datatable(tcga_tigs)

We then save this data and merge them with ORR data according to corresponding tumor types.

readr::write_csv(tcga_tigs, path = "results/TCGA_TIGS_byProject2.csv")

After merging data, we load data into R.

tcga_orr <- readr:::read_csv("results/TCGA_TIGS_byProject2.csv")
#> Parsed with column specification:
#> cols(
#>   Project = col_character(),
#>   Patients_TIGS = col_double(),
#>   TIGS = col_double(),
#>   Patients_APS = col_double(),
#>   APS = col_double(),
#>   Patients_TMB = col_double(),
#>   TMB = col_double(),
#>   Patients_ORR = col_double(),
#>   ORR = col_double(),
#>   Long_Name = col_character()
#> )

Now we plot figures.

library(ggplot2)
require(ggrepel)
require(scales)

APS

ggplot(filter(tcga_orr, !is.na(ORR)), aes(x = APS, y = ORR)) +
  geom_point(aes(color = Patients_APS, size = Patients_ORR)) +
  geom_smooth(method = "lm", se = T) +
  geom_text_repel(aes(label = Project), size = 3) +
  labs(
    x = "Median Normalized APM Score ", y = "Objective Response Rate (%)",
    size = "Objective Response Rate\n(no. of patients evaluated)",
    color = "APM Score\n(no. of tumor analyzed)"
  ) +
  scale_size_continuous(breaks = c(50, 100, 500, 1000)) +
  scale_color_gradientn(
    colours = RColorBrewer::brewer.pal(5, name = "OrRd")[-1],
    breaks = c(50, 200, 500, 1000)
  ) +
  theme_bw() +
  guides(
    color = guide_colorbar(order = 1)
  )

Significant correlation is observed.

lm(ORR ~ APS, filter(tcga_orr, !is.na(ORR))) %>% summary()
#> 
#> Call:
#> lm(formula = ORR ~ APS, data = filter(tcga_orr, !is.na(ORR)))
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -12.37  -3.92  -1.01   2.54  18.73 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)    -2.27       5.54   -0.41   0.6870   
#> APS            30.01      10.03    2.99   0.0086 **
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 7.1 on 16 degrees of freedom
#> Multiple R-squared:  0.359,  Adjusted R-squared:  0.319 
#> F-statistic: 8.96 on 1 and 16 DF,  p-value: 0.00861

R and R square can be given as:

# R
filter(tcga_orr, !is.na(ORR)) %>%
  summarise(r = cor(ORR, APS)) %>%
  pull(r)
#> [1] 0.6
# R square
lm(ORR ~ APS, filter(tcga_orr, !is.na(ORR))) %>%
  summary() %>%
  .$r.squared
#> [1] 0.36

TMB

ggplot(filter(tcga_orr, !is.na(ORR)), aes(x = TMB, y = ORR)) +
  geom_point(aes(color = Patients_TMB, size = Patients_ORR)) +
  geom_smooth(method = "lm", se = T) +
  geom_text_repel(aes(label = Project), size = 3) +
  labs(
    x = "Median No. of Coding Somatic Mutation per MB", y = "Objective Response Rate (%)",
    size = "Objective Response Rate\n(no. of patients evaluated)",
    color = "Tumor Mutational Burden\n(no. of tumor analyzed)"
  ) +
  scale_x_continuous(
    trans = log_trans(),
    breaks = c(2, 10, 20, 30, 40, 50),
    labels = c(2, 10, 20, 30, 40, 50)
  ) +
  scale_size_continuous(breaks = c(50, 100, 500, 1000)) +
  scale_color_gradientn(
    colours = RColorBrewer::brewer.pal(5, name = "OrRd")[-1],
    breaks = c(100, 1000, 5000, 10000)
  ) +
  theme_bw() +
  guides(
    color = guide_colorbar(order = 1)
  )

Significant correlation is observed.

lm(ORR ~ TMB, filter(tcga_orr, !is.na(ORR))) %>% summary()
#> 
#> Call:
#> lm(formula = ORR ~ TMB, data = filter(tcga_orr, !is.na(ORR)))
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>  -8.69  -3.14  -1.44   4.19  13.51 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)    5.794      1.952    2.97   0.0091 ** 
#> TMB            3.147      0.602    5.23  8.3e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 5.4 on 16 degrees of freedom
#> Multiple R-squared:  0.631,  Adjusted R-squared:  0.608 
#> F-statistic: 27.3 on 1 and 16 DF,  p-value: 8.28e-05

R and R square can be given as:

# R
filter(tcga_orr, !is.na(ORR)) %>%
  summarise(r = cor(ORR, TMB)) %>%
  pull(r)
#> [1] 0.79
# R square
lm(ORR ~ TMB, filter(tcga_orr, !is.na(ORR))) %>%
  summary() %>%
  .$r.squared
#> [1] 0.63

TIGS

ggplot(filter(tcga_orr, !is.na(ORR)), aes(x = TIGS, y = ORR)) +
  geom_point(aes(size = Patients_ORR)) +
  geom_smooth(method = "lm", se = T) +
  geom_text_repel(aes(label = Project), size = 3) +
  labs(
    x = "Tumor Immunogenicity Score ", y = "Objective Response Rate (%)",
    size = "Objective Response Rate\n(no. of patients evaluated)"
  ) +
  scale_size_continuous(breaks = c(50, 100, 500, 1000)) +
  theme_bw()

Significant correlation is observed.

lm(ORR ~ TIGS, filter(tcga_orr, !is.na(ORR))) %>% summary()
#> 
#> Call:
#> lm(formula = ORR ~ TIGS, data = filter(tcga_orr, !is.na(ORR)))
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -8.639 -1.122 -0.288  1.826  9.154 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)     1.22       2.04    0.60     0.56    
#> TIGS           22.89       3.26    7.02  2.9e-06 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.4 on 16 degrees of freedom
#> Multiple R-squared:  0.755,  Adjusted R-squared:  0.74 
#> F-statistic: 49.3 on 1 and 16 DF,  p-value: 2.89e-06

R and R square can be given as:

# R
filter(tcga_orr, !is.na(ORR)) %>%
  summarise(r = cor(ORR, TIGS)) %>%
  pull(r)
#> [1] 0.87
# R square
lm(ORR ~ TIGS, filter(tcga_orr, !is.na(ORR))) %>%
  summary() %>%
  .$r.squared
#> [1] 0.75

These data are consistent with our previous results and even better, which may be caused by fewer included tumor types.

Predict ORR for other tumor types

Using constructed linear formula, we can predict ORR of tumor types which (median) TIGS available.

sm_data <- read_csv("../data/summary_data_new_20190411.csv", col_types = cols())
sm_data$TIGS <- sm_data$Pool_APM * log(sm_data$Pool_TMB + 1)

fit <- lm(Pool_ORR ~ TIGS, filter(sm_data, !is.na(Pool_ORR)))
fit
#> 
#> Call:
#> lm(formula = Pool_ORR ~ TIGS, data = filter(sm_data, !is.na(Pool_ORR)))
#> 
#> Coefficients:
#> (Intercept)         TIGS  
#>       -2.66        21.39

For example , we can predict ORR for uterine corpus endometrial carcinoma (UCEC) on the basis of median TIGS of 0.7.

nd <- data.frame(TIGS = c(0.7))
predict(fit, newdata = nd, interval = "confidence")
#>   fit lwr upr
#> 1  12 8.8  16