Antigen presentation and tumor immunogenicity in cancer immunotherapy response prediction

Please note this work under Apache License v2 license, copyright belongs to Shixiang Wang and Xue-Song Liu. And the study has been applied for a national patent in China.

This document is compiled from an Rmarkdown file which contains all code or description necessary to reproduce the analysis for the accompanying project. Each section below describes a different component of the analysis and all numbers and figures are generated directly from the underlying data on compilation.

Dependencies

The preprocessing step is dependent on R software and some R packages, if you have not used R yet or R is not installed, please see https://cran.r-project.org/.

R packages:

  • UCSCXenaTools - download data from UCSC Xena
  • GEOquery - download data from NCBI GEO database
  • tidyverse - operate data, plot
  • data.table - operate data
  • survival - built in R, used to do survival analysis
  • metafor, metawho - meta-analysis
  • forestmodel - generate forestplot for meta-analysis model
  • forestplot - plot forestplot
  • survminer - plot survival fit
  • pROC - ROC analysis and visualization
  • TCGAmutations - download TCGA mutation data
  • DT - show data table as a table in html
  • GSVA - GSVA algorithm implementation
  • ggstatsplot - plot scatter with linear fit
  • corrplot - plot correlation
  • knitr, rmdformats - used to compile this file
  • readxl - read xlsx data
  • Some other dependent packages

These R packages are easily searched by internet and installed from either CRAN or Bioconductor, they have no strict version requirements to reproduce the following analyses.

Data download and preprocessing

This part will clearly describe how to process raw data but not provide an easy script for re-running all preprocess steps just by a click due to the complexity of data preprocessing.

TCGA Pan-cancer data download

TCGA Pan-cancer data (Version 2017-10-13), including datasets of clinical informaiton, gene expression, are downloaded from UCSC Xena via R package UCSCXenaTools.

UCSCXenaTools is developed by Shixiang and it is an R package to robustly access data from UCSC Xena data hubs. More please see paper: Wang et al., (2019). The UCSCXenaTools R package: a toolkit for accessing genomics data from UCSC Xena platform, from cancer multi-omics to single-cell RNA-seq. Journal of Open Source Software, 4(40), 1627, https://doi.org/10.21105/joss.01627

If you have not installed this package yet, please run following command in R.

Load the package with:

Obtain datasets information available at TCGA data hubs of UCSC Xena:

Obtain clinical datasets of TCGA:

Obtain gene expression datasets of TCGA:

Create data queries and download them:

The RNASeq data we downloaded are pancan normalized.

For comparing data within independent cohort (like TCGA-LUAD), we recommend to use the “gene expression RNAseq” dataset. For questions regarding the gene expression of this particular cohort in relation to other types tumors, you can use the pancan normalized version of the “gene expression RNAseq” data. For comparing with data outside TCGA, we recommend using the percentile version if the non-TCGA data is normalized by percentile ranking. For more information, please see our Data FAQ: here.

These datasets are downloaded to local machine, we need to load them into R. However, whether filenames of datasets or contents in datasets all look messy, next we need to clean them before real analysis.

TCGA pan-cancer data clean

Clean filenames

First, clean filenames.

Check.

Obtain TCGA project abbreviations from https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-study-abbreviations and read as a data.frame in R.

Compare difference.

Remove TCGA.FPPP.sampleMap__FPPP_clinicalMatrix.gz which has no RNAseq data. Remove other 3 datasets which merge more than one TCGA study, only keep individual study from TCGA.

Clean clinical datasets

Now, we read TCGA clinical datasets and clean them.

library(tidyverse)
#> ── Attaching packages ─────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
#> ✔ tibble  2.1.3     ✔ stringr 1.4.0
#> ✔ tidyr   1.0.0     ✔ forcats 0.4.0
#> ── Conflicts ────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
# keep valuable columns of clinical datasets
select_cols <- c(
  "sampleID", "OS", "OS.time", "OS.unit", "RFS", "RFS.time", "RFS.unit",
  "age_at_initial_pathologic_diagnosis", "gender", "tobacco_smoking_history",
  "tobacco_smoking_history_indicator", "sample_type", "pathologic_M",
  "pathologic_N", "pathologic_T", "pathologic_stage"
)

# read clinical files
Cli_DIR <- paste0(TCGA_DIR, "/Clinical")

#----------------------------------
# Load and preprocess clinical data
#----------------------------------
Clinical_List <- XenaPrepare(paste0(Cli_DIR, "/", Clinical_filelist2))
names(Clinical_List)
#>  [1] "TCGA.ACC.sampleMap__ACC_clinicalMatrix.gz"  
#>  [2] "TCGA.BLCA.sampleMap__BLCA_clinicalMatrix.gz"
#>  [3] "TCGA.BRCA.sampleMap__BRCA_clinicalMatrix.gz"
#>  [4] "TCGA.CESC.sampleMap__CESC_clinicalMatrix.gz"
#>  [5] "TCGA.CHOL.sampleMap__CHOL_clinicalMatrix.gz"
#>  [6] "TCGA.COAD.sampleMap__COAD_clinicalMatrix.gz"
#>  [7] "TCGA.DLBC.sampleMap__DLBC_clinicalMatrix.gz"
#>  [8] "TCGA.ESCA.sampleMap__ESCA_clinicalMatrix.gz"
#>  [9] "TCGA.GBM.sampleMap__GBM_clinicalMatrix.gz"  
#> [10] "TCGA.HNSC.sampleMap__HNSC_clinicalMatrix.gz"
#> [11] "TCGA.KICH.sampleMap__KICH_clinicalMatrix.gz"
#> [12] "TCGA.KIRC.sampleMap__KIRC_clinicalMatrix.gz"
#> [13] "TCGA.KIRP.sampleMap__KIRP_clinicalMatrix.gz"
#> [14] "TCGA.LAML.sampleMap__LAML_clinicalMatrix.gz"
#> [15] "TCGA.LGG.sampleMap__LGG_clinicalMatrix.gz"  
#> [16] "TCGA.LIHC.sampleMap__LIHC_clinicalMatrix.gz"
#> [17] "TCGA.LUAD.sampleMap__LUAD_clinicalMatrix.gz"
#> [18] "TCGA.LUSC.sampleMap__LUSC_clinicalMatrix.gz"
#> [19] "TCGA.MESO.sampleMap__MESO_clinicalMatrix.gz"
#> [20] "TCGA.OV.sampleMap__OV_clinicalMatrix.gz"    
#> [21] "TCGA.PAAD.sampleMap__PAAD_clinicalMatrix.gz"
#> [22] "TCGA.PCPG.sampleMap__PCPG_clinicalMatrix.gz"
#> [23] "TCGA.PRAD.sampleMap__PRAD_clinicalMatrix.gz"
#> [24] "TCGA.READ.sampleMap__READ_clinicalMatrix.gz"
#> [25] "TCGA.SARC.sampleMap__SARC_clinicalMatrix.gz"
#> [26] "TCGA.SKCM.sampleMap__SKCM_clinicalMatrix.gz"
#> [27] "TCGA.STAD.sampleMap__STAD_clinicalMatrix.gz"
#> [28] "TCGA.TGCT.sampleMap__TGCT_clinicalMatrix.gz"
#> [29] "TCGA.THCA.sampleMap__THCA_clinicalMatrix.gz"
#> [30] "TCGA.THYM.sampleMap__THYM_clinicalMatrix.gz"
#> [31] "TCGA.UCEC.sampleMap__UCEC_clinicalMatrix.gz"
#> [32] "TCGA.UCS.sampleMap__UCS_clinicalMatrix.gz"  
#> [33] "TCGA.UVM.sampleMap__UVM_clinicalMatrix.gz"
sub("TCGA\\.(.*)\\.sampleMap.*", "\\1", names(Clinical_List)) -> project_code

TCGA_Clinical <- tibble()
# for loop
for (i in 1:length(project_code)) {
  clinical <- names(Clinical_List)[i]
  project <- project_code[i]
  df <- Clinical_List[[clinical]]
  col_exist <- select_cols %in% colnames(df)
  res <- tibble()
  if (!all(col_exist)) {
    res <- df[, select_cols[col_exist]]
    res[, select_cols[!col_exist]] <- NA
  } else {
    res <- df[, select_cols]
  }
  res$Project <- project
  res %>% select(Project, select_cols) -> res
  TCGA_Clinical <- bind_rows(TCGA_Clinical, res)
}

rm(res, df, i, clinical, project, col_exist) # remove temp variables

View the clinical data, check variables.

All unit are same in days, remove two column with redundant information.

Continue to tidy clinical datasets: rename variables, filter 14 samples with unusual sample types, rename variable value and . After cleanning, save the result object TCGA_Clinical.tidy for following analysis.

Selection of APM genes and marker genes of immune cell type

Marker genes for immune cell types were obtained from Senbabaoglu, Y. et al, selection of genes involved in processing and presentation of antigen (APM) was also inspired by this paper and finally 18 genes were selected based on literature review of APM and some facts that mutations or deletions affecting genes encoding APM (HLA, beta2-microglobulin, TAP1/2 etc.) may lead to reduced presentation of neoantigens by a cancer cell.

View gene list.

Remove all variables.

Calculation of APS, TIS and IIS

The GSVA package allows one to perform a change in coordinate systems of molecular measurements, transforming the data from a gene by sample matrix to a gene-set by sample matrix, thereby allowing the evaluation of pathway enrichment for each sample. This new matrix of GSVA enrichment scores facilitates applying standard analytical methods like functional enrichment, survival analysis, clustering, CNV-pathway analysis or cross-tissue pathway analysis, in a pathway-centric manner.

APM score (APS) and score of each immune cell type are directly calculated from GSVA method. Aggregate TIS (T cell infiltration score) and IIS (Immune cell infiltration score) are calculated using following method, their effectiveness have been validated by many studies.

The TIS was defined as the mean of the standardized values for the following T cell subsets: CD8 T, T helper, T, T central and effector memory, Th1, Th2, Th17, and Treg cells. The immune infiltration score (IIS) for a sample was defined as the mean of standardized values for macrophages, DC subsets (total, plasmacytoid, immature, activated), B cells, cytotoxic cells, eosinophils, mast cells, neutrophils, NK cell subsets (total, CD56 bright, CD56 dim), and all T cell subsets (CD8 T, T helper, T central and effector memory, Th1, Th2, Th17, and Treg cells). (Senbabaoglu, Y. et al)

Load function used to apply GSVA method.

Apply GSVA to TCGA data. This step should run on a machine with big memory.

Calculate TIS and IIS.

View this table (head 100 rows).

save GSVA scores of APM and immune cell types, TIS and IIS.

TCGA mutation download and clean

We downloaded TCGA somatic mutations (data source: TCGA MC3) via TCGAmutations package.

TCGAmutations is an R data package containing somatic mutations from TCGA cohorts. This is particularly useful for those working with mutation data from TCGA studies - where most of the time is spent on searching various databases, downloading, compiling and tidying up the data before even the actual analysis is started. This package tries to mitigate the issue by providing pre-compiled, curated somatic mutations from 33 TCGA cohorts along with relevant clinical information for all sequenced samples.

Load mutations of TCGA studies and merge them into one.

GEO datasets download and processing

We also tried our best to collect gene expression data to extend APM score (APS), IIS etc. calculation for more cancer types by searching NCBI GEO databases. We only focus on looking for cancer types which showed in list of our immunotherapy clinical studies and not showed in TCGA studies. Finally, we found 5 GEO datasets which can extract gene expression data of Merkel Cell Carcinoma, Small Cell Lung Cancer and Cutaneous Squamous Carcinoma. Download of these datasets need R package GEOquery is installed. Following code showed how we downloaded and carefully processed them.

#----------------------------------------------------------------
# Purpose:
# Add APM data from GEO for some tumor types
# Include Merkel Cell Carcinoma and Small Cell Lung Cancer
#         and Cutaneous Squamous Carcinoma
#----------------------------------------------------------------
library(GEOquery)
library(tidyverse)

if (!dir.exists("../data/GEOdata")) {
  dir.create("../data/GEOdata")
}

geo_dir <- "../data/GEOdata"

GSE_39612 <- getGEO("GSE39612", GSEMatrix = TRUE, AnnotGPL = TRUE, destdir = geo_dir)
GSE_22396 <- getGEO("GSE22396", GSEMatrix = TRUE, AnnotGPL = TRUE, destdir = geo_dir)
GSE_36150 <- getGEO("GSE36150", GSEMatrix = TRUE, AnnotGPL = TRUE, destdir = geo_dir)
GSE_50451 <- getGEO("GSE50451", GSEMatrix = TRUE, AnnotGPL = TRUE, destdir = geo_dir)
GSE_99316 <- getGEO("GSE99316", GSEMatrix = TRUE, AnnotGPL = TRUE, destdir = geo_dir)

exprs(GSE_22396$GSE22396_series_matrix.txt.gz) %>% nrow()
pData(GSE_22396$GSE22396_series_matrix.txt.gz)
fData(GSE_22396$GSE22396_series_matrix.txt.gz) %>% nrow()

#---------------------------------------------------------
# process GSE_39612
gset1 <- GSE_39612$GSE39612_series_matrix.txt.gz
View(pData(gset1))

gset1_mcc <- gset1[, grep("MCC", gset1$title)]
gset1_scc <- gset1[, grep("SCC", gset1$title)]

rm(gset1)
# process GSE_22396
gset2 <- GSE_22396$GSE22396_series_matrix.txt.gz
View(pData(gset2))

# process GSE_36150
gset3 <- GSE_36150$GSE36150_series_matrix.txt.gz
View(pData(gset3))

# process GSE_50451
gset4_1 <- GSE_50451$`GSE50451-GPL570_series_matrix.txt.gz`
gset4_2 <- GSE_50451$`GSE50451-GPL571_series_matrix.txt.gz`

View(pData(gset4_1))
View(pData(gset4_2))
# gset4_1 all cell_lines, not use it
#
gset4_mcc <- gset4_2[, grep("MCC tumor", gset4_2$source_name_ch1)]
gset4_sclc <- gset4_2[, grep("SCLC tumor", gset4_2$source_name_ch1)]

rm(gset4_1, gset4_2)
# process GSE_99316
gset5_1 <- GSE_99316$`GSE99316-GPL10999_series_matrix.txt.gz`
gset5_2 <- GSE_99316$`GSE99316-GPL570_series_matrix.txt.gz`
gset5_3 <- GSE_99316$`GSE99316-GPL96_series_matrix.txt.gz`
gset5_4 <- GSE_99316$`GSE99316-GPL97_series_matrix.txt.gz`

View(pData(gset5_1))
View(pData(gset5_2))
View(pData(gset5_3))
View(pData(gset5_4))

# only gset5_2 has data we need
gset5_sclc <- gset5_2[, grep("SCLC", gset5_2$title)]

rm(gset5_1, gset5_2, gset5_3, gset5_4)

#-------------------------------------------------
# Now summary data
merkel_data <- list(gset1_mcc, gset2, gset3, gset4_mcc)
scc_data <- list(gset1_scc)
sclc_data <- list(gset4_sclc, gset5_sclc)

#--------------------
# apply GSVA method
load("results/merged_geneList.RData")

# transform exprset to tibble
genTibbleList <- function(gsetList) {
  stopifnot(is.list(gsetList), require("Biobase"), require("tidyverse"), class(gsetList[[1]]) == "ExpressionSet")

  res <- list()
  i <- 1
  for (gset in gsetList) {
    eset <- exprs(gset)
    # find gene symbol column
    fdata <- fData(gset)
    symbol_col <- grep("^gene.?symbol", colnames(fdata), value = TRUE, ignore.case = TRUE)

    if (length(symbol_col) == 0) {
      message("Find nothing about gene symbol in fData, try search it...")
      symbol_col2 <- grep("^gene_assignment", colnames(fdata), value = TRUE, ignore.case = TRUE)
      message("find ", symbol_col2)

      message("processing...")
      strlist <- strsplit(fdata[, symbol_col2], split = " // ")
      rowname <- sapply(strlist, function(x) trimws(x[2]))
      rownames(eset) <- rowname

      # stop("Something wrong with your fData of input List, please check it")
    }
    if (length(symbol_col) > 1) {
      warning("Multiple columns of fData match gene symbol, only use the first one")
      symbol_col <- symbol_col[1]
      rownames(eset) <- fdata[, symbol_col]
    } else if (length(symbol_col) == 1){
      rownames(eset) <- fdata[, symbol_col]
    }

    # remove duplicate rows, keep the one with biggest mean value
    eset %>%
      as.data.frame() %>%
      rownames_to_column() %>%
      mutate(
        mean_expr = rowMeans(.[, -1], na.rm = TRUE),
        rowname = sub("^(\\w+)\\..*", "\\1", rowname)
      ) %>%
      arrange(rowname, desc(mean_expr)) %>%
      distinct(rowname, .keep_all = TRUE) %>%
      select(-mean_expr) %>%
      as.tibble() -> res[[i]]

    i <- i + 1
  }
  return(res)
}

# apply GSVA method
applyGSVA <- function(group_df, group_col, gene_col, ExprMatList,
                      method = c("ssgsea", "gsva", "zscore", "plage"),
                      kcdf = c("Gaussian", "Poisson")) {
  stopifnot(inherits(group_df, "tbl_df") &
    inherits(group_col, "character") &
    inherits(gene_col, "character") &
    inherits(ExprMatList, "list"))
  if (!require(GSVA)) {
    stop("GSVA package need to be installed!")
  }

  method <- match.arg(method)
  kcdf <- match.arg(kcdf)

  require(dplyr)

  i <- 1
  resList <- list()
  groups <- names(table(group_df[, group_col]))
  gset_list <- lapply(groups, function(x) {
    group_df[group_df[, group_col] == x, gene_col] %>%
      unlist() %>%
      as.character()
  })

  names(gset_list) <- groups

  for (expr_mat in ExprMatList) {
    if (!inherits(expr_mat, "tbl_df")) {
      stop("element of ExprMatList should be tibble!")
    }
    expr_mat <- as.data.frame(expr_mat)
    rownames(expr_mat) <- expr_mat[, 1]
    expr_mat <- expr_mat[, -1] %>% as.matrix()

    res <- gsva(expr = expr_mat, gset.idx.list = gset_list, method = method, kcdf = kcdf)
    res <- as.data.frame(t(res))
    colnames(res)[1] <- "tsb"
    resList[[i]] <- res
    names(resList)[i] <- names(ExprMatList)[i]
    i <- i + 1
  }
  return(resList)
}


## scc
tibble.scc <- genTibbleList(scc_data)
gsva.scc <- applyGSVA(merged_geneList,
  group_col = "Cell_type",
  gene_col = "Symbol", ExprMatList = tibble.scc, method = "gsva"
)
## sclc
tibble.sclc <- genTibbleList(sclc_data)
gsva.sclc <- applyGSVA(merged_geneList,
  group_col = "Cell_type",
  gene_col = "Symbol", ExprMatList = tibble.sclc, method = "gsva"
)

## merkel1
tibble.merkel <- genTibbleList(merkel_data)

gsva.merkel <- applyGSVA(merged_geneList,
  group_col = "Cell_type",
  gene_col = "Symbol", ExprMatList = tibble.merkel, method = "gsva"
)

save(gsva.scc, gsva.sclc, gsva.merkel, file = "results/Add_gsva_scc_sclc_merkel.RData")

Immunotherapy clinical studies and genomics datasets

Immunotherapy clinical studies

Collecting response rate of immunotherapy clinical studies was inspired by the paper Tumor Mutational Burden and Response Rate to PD-1 inhibition which published on journal NEJM, this paper showed data from about 50 stuides. However, detail values of response rate the authors collected were not published. We followed their search strategy and tried our best to find all clinical studies which recorded response rate. Totally, we reviewed abstract of over 100 clinical studies, collected response rate values, then carefully filtered them based on standard we set, selected the most respresenting data (from about 60 studies) for downstream analysis. More detail of this procedure please see Method section of our manuscript. The data we collected for analysi are double checked and open to readers as a supplementary table.

Immunotherapy genomics datasets

To evaluate the predictive power of TIGS in ICI clinical response prediction, we searched PubMed for ICI clinical studies with available individual patient’s TMB and gene transcriptome information. In total three datasets are identified after this search. Van Allen et al 2015 dataset was downloaded from supplementary files of reference. This dataset studied CTLA-4 blockade in metastatic melanoma, and “clinical benefit” was defined using a composite end point of complete response or partial response to CTLA-4 blockade by RECIST criteria or stable disease by RECIST criteria with overall survival greater than 1 year, “no clinical benefit” was defined as progressive disease by RECIST criteria or stable disease with overall survival less than 1 year. Hugo et al 2016 dataset was downloaded from supplementary files of reference. This dataset studied Anti-PD-1therapy in metastatic melanoma, responding tumors were derived from patients who have complete or partial responses or stable disease in response to anti-PD-1 therapy, non-responding tumors were derived from patients who had progressive disease. Snyder et al dataset was downloaded from https://github.com/hammerlab/multi-omic-urothelial-anti-pdl1. This dataset studied PD-L1 blockade in urothelial cancer, and durable clinical benefit was defined as progression-free survival >6 months. RNA-Seq data was used to calculate the APS for each patient. Only patients with both APS and TMB value were used to calculate the TIGS. Median of TMB or TIGS was used as the threshold to separate TMB High/Low group or TIGS High/Low group in Kaplan-Meier overall survival curve analysis.

Detail of how to process these 3 datasets will be described at analysis part.

TCGA Pan-cancer analyses

This part will clearly describe how to analyze TCGA Pan-cancer data. Raw data used for TCGA pancan analyses have been preprocessed, detail please read preprocessing part of this analysis report.

Clean and combine data

Although data have been preprocessed in preprocessing part, they are still necessary to do some cleaning before really analyzing them according to our purpose.

Only keep tumor samples, and filter sample which sample type is “0” or “X”. Number of samples with these two sample type are very few.

Totally, 9095 tumor samples have APS value.

Strong association between APS and immune cell infiltration level

We know that genes used for APS calculation does not overlap with genes of immune cell type, but we don’t know if there are association between APS and them. Besides, we also don’t know if there are association between APS and two aggregate scores: TIS and IIS.

UPDATE: Also, we also want to compare them with APS7 and PHBR scores from MHC-I Genotype Restricts the Oncogenic Mutational Landscape and Evolutionary Pressure against MHC Class II Binding Cancer Mutations. Therefore, we integrate them all before analyzing.

Combine them.

Association between GSVA scores and other presentation scores

Here we explore association across APS, APS7, IIS, PHBR_I, PHBR_II.

First we check pan-cancer level association.

Show heatmap.

Second we check cancer type specific level association.

Here we get 32 correlation matrixes.

Output the result pheatmaps to PDF is a good option to observe results.

Association between subsets of immune cells and APS

Next we answer the question: which cells (CIBERSORT scores) or subsets of the IIS scores are most associated with the APS scores.

The first step is to calculate correlation using spearman method.

Show plot.

This shows that there is strong positive/consistent relationship between APM and immune/T cell infiltration level across TCGA cancer types. These spearman correlation coefficient are showed as a table.

Thanks to TCGA, we can obtain CIBERSORT results for TCGA patients from paper The Immune Landscape of Cancer.

Combine APM (score) and CIBERSORT information, calculate correlation and plot heatmap.

Indeed, APM score correlate with T cell infiltration.

However, we see PHBR_I, PHBR_II does not have relationship with either APS or IIS, does this score correlate with immune cell infiltration?

Let’s check it.

Do the same analysis for PHBR_II score.

IIS is a good representation of immune cell infiltration

Immune/T cell infiltration is a good indicator for ICB response. Here we have seen that APS have strong association with both TIS and IIS. In indivial level, i.e. 9095 samples, we also observe that TIS and IIS are basically same because their correlation coeficient is about 0.91!

Therefore, we should choose one of them for downstream analysis. Finally, we choose IIS not only because it is more comprehensive, but also it is well validated.

  • In vitro validation with multiplex immunofluorescence, in silico validation using simulated mixing proportions and comparison between CIBERSORT and IIS have been previously described (Senbabaoglu, Y. et al).

TIMER is another method that can accurately resolve relative fractions of diverse cell types based on gene expression profiles from complex tissues. To further validate the calculated IIS, we perform TIMER analysis and find that the result of TIMER is highly correlated with the calculated IIS.

The color bar in figure shows correlation coefficient values. The “X” marks relationship does not pass significant test (i.e. p>=0.05).

Even Senbabaoglu et al have checked the relationship between IIS and CIBERSORT in clear cell renal cell carcinoma. Here we expand it to all TCGA cancer type with CIBERSORT data and IIS avaiable.

Compare with timer, the relationship between IIS and immune cell subsets is more variable. I think this can be explained by several reasons:

  • CIBERSORT and TIMER target different aspects of tumor immune infiltrates. CIBERSORT infers the relative fractions of immune subsets in the total leukocyte population, while TIMER predicts the abundance of immune cells in the overall tumor microenvironment. Currently both methods are limited by the assumption that transcriptomes of tumor-infiltrating immune cells do not significantly differ from those collected from peripheral blood of healthy donors. (This is from Revisit linear regression-based deconvolution methods for tumor gene expression data)
  • Many cibersort values are equal to 0 or close to 0, this reduce the power to calculate correlation
  • CIBERSORT predicts 22 immune cell subsets while TIMER obtains 6 cell types, it is not unexpected to observe a more variable result using CIBERSORT.

Exploration of APS, TMB, TIGS at pan-cancer level

Load TCGA TMB data and merge all necessary datasets.

Tumor mutation burden (TMB) is defined as the number of non-synonymous alterations per megabase (Mb) of genome examined. As reported previously, here we use 38 Mb as the estimate of the exome size. For studies reporting mutation number from whole exome sequencing, the normalized TMB = (whole exome non-synonymous mutation)/(38 Mb).

Original APM scores (APS) from GSVA are in the range of -1 to 1. To calculate tumor immunogenicity score (TIGS), original APM score from GSVA implementation is rescaled by minimal and maximal APM score from TCGA Pan-cancer analysis.

We calculate tumor immunogenicity score (TIGS) as following:

\[ TIGS = APS_{normalized} \times log(TMB + 1) \]

Natural logarithm is applied here. Of note, some tumors have TMB level below 1 mutation/ Mb, to avoid minus number in quantifying “tumor antigenicity”, we add number 1 to all normalized TMB.

How we generate this TIGS formula will be described at an individual part. Here we focus on association of APS, TMB and TIGS and their effects.

Distribution of APS, TMB and IIS across TCGA studies

Calculate median values to sort distribution.

Distribution of APM score (APS) across TCGA studies.

Distribution of TMB across TCGA studies.

Distribution of TIGS across TCGA studies.

Correlation between TMB and IIS

Survival analysis

Check pan-cancer level influence of APS, TMB and TIGS using unvariable cox model.

We access survival effects of APS, TMB and TIGS across TCGA studies using unvariable cox model.

Firstly, we implement cox model and then obtain key result values from fit, i.e. p value and corresponding 95% confident interval.

library(survival)
# calculate APM cox model by project
model_APM <- df_os %>%
  filter(!is.na(nAPM)) %>%
  group_by(Project) %>%
  dplyr::do(coxfit = coxph(Surv(time = Time, event = Event) ~ nAPM, data = .)) %>%
  summarise(
    Project = Project,
    Coef = summary(coxfit)$conf.int[1],
    Lower = summary(coxfit)$conf.int[3],
    Upper = summary(coxfit)$conf.int[4],
    Pvalue = summary(coxfit)$logtest[3]
  )

# use nTMB or log(nTMB)
model_TMB <- df_os %>%
  filter(!is.na(nTMB)) %>%
  group_by(Project) %>%
  dplyr::do(coxfit = coxph(Surv(time = Time, event = Event) ~ log(nTMB), data = .)) %>%
  summarise(
    Project = Project,
    Coef = summary(coxfit)$conf.int[1],
    Lower = summary(coxfit)$conf.int[3],
    Upper = summary(coxfit)$conf.int[4],
    Pvalue = summary(coxfit)$logtest[3]
  )

model_TIGS <- df_os %>%
  filter(!is.na(TIGS)) %>%
  group_by(Project) %>%
  dplyr::do(coxfit = coxph(Surv(time = Time, event = Event) ~ TIGS, data = .)) %>%
  summarise(
    Project = Project,
    Coef = summary(coxfit)$conf.int[1],
    Lower = summary(coxfit)$conf.int[3],
    Upper = summary(coxfit)$conf.int[4],
    Pvalue = summary(coxfit)$logtest[3]
  )

N_APM <- df_os %>%
  filter(!is.na(nAPM)) %>%
  group_by(Project) %>%
  summarise(N = n())

N_TMB <- df_os %>%
  filter(!is.na(nTMB)) %>%
  group_by(Project) %>%
  summarise(N = n())

N_TIGS <- df_os %>%
  filter(!is.na(TIGS)) %>%
  group_by(Project) %>%
  summarise(N = n())


cox_APM <- full_join(model_APM, N_APM)
#> Joining, by = "Project"
cox_TMB <- full_join(model_TMB, N_TMB)
#> Joining, by = "Project"
cox_TIGS <- full_join(model_TIGS, N_TIGS)
#> Joining, by = "Project"


save(cox_APM, cox_TMB, cox_TIGS, df_summary, file = "results/unicox.RData")

UPDATE: Adjust p values and order them by median APM/TMB/TIGS.

Secondly, we generate forest plot.

APS, TIGS and B2M mutation

Obtain B2M mutation status.

Focus on tructing mutation https://www.nejm.org/doi/10.1056/NEJMoa1604958?url_ver=Z39.88-2003&rfr_id=ori:rid:crossref.org&rfr_dat=cr_pub%3dwww.ncbi.nlm.nih.gov

Check gene expression in these mutant samples.

Therefore, APS/TIGS cannot capture whether B2M lose function or not. However, we do observe that B2M mutation upgrade APS/TIGS, this may be explained by compensatory expression of APS gene signature.

Gene sets enriched in patients with high APM score

To identify the specific gene sets/pathway associated with high APS, we firstly run differential gene expression analysis for each TCGA cancer type based on APS status. Patients with APS of first quartile was defined as “APS-High”, patients with APS of the forth quartile was defined as “APS-Low”. Genes with p value <0.01 and FDR <0.05 were ranked by logFC from top to bottom and then inputted into GSEA function of R package clusterProfiler with custom gene sets download from Molecular Signature Database v6.2. Normalized enrichment score (NES) was used to rank the differentially enriched gene sets. In results from hallmark gene sets, several gene signatures (especially interferon alpha/gamma response) were found to be enriched in most TCGA cancer types with high APS, suggesting high APS are strongly associated with interferon alpha/gamma signaling pathway.

The code below is runned on linux server (need much time), thus if reader wanna reproduce it, you may need to download gene sets from Molecular Signature Database v6.2 and modify some file path.

rm(list = ls())
#----------------------------------------
# APM DEG and Pathway Enrichment Analysis
#----------------------------------------
library(tidyverse)

load("results/TCGA_RNASeq_PanCancer.RData")
load("results/gsva_tcga_pancan.RData")
load("results/TCGA_tidy_Clinical.RData")

df.gsva <- full_join(TCGA_Clinical.tidy, gsva.pac, by = c("Tumor_Sample_Barcode" = "tsb"))

tcga_info <- df.gsva
rm(df.gsea, df.gsva)
gc()

tcga_info <- tcga_info %>%
  select(Project:OS.time, Age, Gender, sample_type, Tumor_stage, APM) %>%
  filter(!is.na(APM))

projects <- unique(tcga_info[, "Project"])
samples <- colnames(RNASeq_pancan)[-1]

#-----------------------------------
# Build a workflow to calculate DEGs
#-----------------------------------
findDEGs <- function(info_df = NULL, expr_df = NULL, col_sample = "Tumor_Sample_Barcode", col_group = "APM",
                     col_subset = "Project", method = "limma", threshold = 0.25,
                     save = FALSE, filename = NULL) {
  stopifnot(!is.null(info_df), !is.null(expr_df))
  stopifnot(threshold >= 0 & threshold <= 1)

  if (!require(data.table)) {
    install.packages("data.table", dependencies = TRUE)
  }

  info_df <- setDT(info_df[, c(col_sample, col_group, col_subset)])
  colnames(info_df) <- c("sample", "groupV", "subset")
  info_df <- info_df[sample %in% colnames(expr_df)]
  info_df <- info_df[!is.na(subset)]
  all_sets <- unique(info_df[, subset])


  if ("DEGroup" %in% colnames(info_df)) {
    stop("DEGroup column exists, please rename and re-run.")
  }

  # set threshold
  th1 <- threshold
  th2 <- 1 - th1

  info_df[, DEGroup := ifelse(groupV < quantile(groupV, th1), "Low",
    ifelse(groupV > quantile(groupV, th2), "High", NA)
  ), by = subset]
  info_df <- info_df[!is.na(DEGroup)]
  sets <- unique(info_df[, subset])
  may_del <- setdiff(all_sets, sets)
  if (length(may_del) != 0) {
    message("Following groups has been filtered out because of threshold setting, you better check")
    print(may_del)
  }

  info_list <- lapply(sets, function(x) info_df[subset == x])
  names(info_list) <- sets

  col1 <- colnames(expr_df)[1]


  options(digits = 4)

  doDEG <- function(input, method = NULL) {
    #-- prepare
    exprSet <- expr_df[, c(col1, input$sample)]
    exprSet <- as.data.frame(exprSet)
    exprSet <- na.omit(exprSet)
    input <- as.data.frame(input)
    rownames(exprSet) <- exprSet[, 1]
    exprSet <- exprSet[, -1]

    sample_tb <- table(input$DEGroup)

    #--- make sure have some samples
    if (!length(sample_tb) < 2 & all(sample_tb >= 5)) {
      group_list <- input$DEGroup

      if ("limma" %in% method) {
        suppressMessages(library(limma))
        design <- model.matrix(~ 0 + factor(group_list))
        colnames(design) <- c("High", "Low")
        cont.matrix <- makeContrasts("High-Low", levels = design)

        fit <- lmFit(exprSet, design)
        fit2 <- contrasts.fit(fit, cont.matrix)
        fit2 <- eBayes(fit2)
        topTable(fit2, number = Inf, adjust.method = "BH")
      }
    }
  }

  res <- lapply(info_list, doDEG, method = method)
  names(res) <- sets
  return(res)
}

# threshold = 0.25 is used in manuscript
# DEG_pancan2 = findDEGs(info_df = tcga_info, expr_df = RNASeq_pancan)
# Here we use threshold = 0.5 to do this analysis to respond
# reviewer's comment
DEG_pancan2 <- findDEGs(info_df = tcga_info, expr_df = RNASeq_pancan, threshold = 0.5)
save(DEG_pancan2, file = "results/DEG_pancan.RData")

#----------------------------------------------------
# Pathway enrichment analysis using clusterProfiler
#--------------------------------------------------
load(file = "results/DEG_pancan.RData")
library(clusterProfiler)
library(tidyverse)
library(openxlsx)

#---------------

## - setting
pvalue <- 0.01
adj.pvalue <- 0.05

## - process

#------- Reading GSEA genesets files
hallmark <- read.gmt("/public/data/VM_backup/biodata/MsigDB/h.all.v6.2.symbols.gmt")
c1 <- read.gmt("/public/data/VM_backup/biodata/MsigDB/c1.all.v6.2.symbols.gmt")
c2_kegg <- read.gmt("/public/data/VM_backup/biodata/MsigDB/c2.cp.kegg.v6.2.symbols.gmt")
c2_reactome <- read.gmt("/public/data/VM_backup/biodata/MsigDB/c2.cp.reactome.v6.2.symbols.gmt")
c3 <- read.gmt("/public/data/VM_backup/biodata/MsigDB/c3.all.v6.2.symbols.gmt")
c4 <- read.gmt("/public/data/VM_backup/biodata/MsigDB/c4.all.v6.2.symbols.gmt")
c5_mf <- read.gmt("/public/data/VM_backup/biodata/MsigDB/c5.mf.v6.2.symbols.gmt")
c5_bp <- read.gmt("/public/data/VM_backup/biodata/MsigDB/c5.bp.v6.2.symbols.gmt")
c6 <- read.gmt("/public/data/VM_backup/biodata/MsigDB/c6.all.v6.2.symbols.gmt")
c7 <- read.gmt("/public/data/VM_backup/biodata/MsigDB/c7.all.v6.2.symbols.gmt")
#-=---------

goGSEA <- function(DEG, prefix = NULL, pvalue = 0.01, adj.pvalue = 0.05, destdir = "~/projects/tumor-immunogenicity-score/report/results/GSEA_results") {
  library(clusterProfiler)
  library(openxlsx)
  library(tidyverse)

  filterDEG <- subset(DEG, subset = P.Value < pvalue & adj.P.Val < adj.pvalue)
  filterDEG$SYMBOL <- rownames(filterDEG)
  filterDEG <- filterDEG %>%
    arrange(desc(logFC), adj.P.Val)
  geneList <- filterDEG$logFC
  names(geneList) <- filterDEG$SYMBOL

  res <- list()
  if (base::exists("hallmark")) res$hallmark <- GSEA(geneList, TERM2GENE = hallmark, verbose = FALSE)
  # if (base::exists("c1")) res$c1 = GSEA(geneList, TERM2GENE=c1, verbose=FALSE)
  if (base::exists("c2_kegg")) res$c2_kegg <- GSEA(geneList, TERM2GENE = c2_kegg, verbose = FALSE)
  if (base::exists("c2_reactome")) res$c2_reactome <- GSEA(geneList, TERM2GENE = c2_reactome, verbose = FALSE)
  # if (base::exists("c3")) res$c3 = GSEA(geneList, TERM2GENE=c3, verbose=FALSE)
  # if (base::exists("c4")) res$c4 = GSEA(geneList, TERM2GENE=c4, verbose=FALSE)
  if (base::exists("c5_mf")) res$c5_mf <- GSEA(geneList, TERM2GENE = c5_mf, verbose = FALSE)
  if (base::exists("c5_bp")) res$c5_bp <- GSEA(geneList, TERM2GENE = c5_bp, verbose = FALSE)
  if (base::exists("c6")) res$c6 <- GSEA(geneList, TERM2GENE = c6, verbose = FALSE)
  if (base::exists("c7")) res$c7 <- GSEA(geneList, TERM2GENE = c7, verbose = FALSE)

  getResultList <- lapply(res, function(x) x@result)
  if (!dir.exists(destdir)) dir.create(destdir)
  outpath <- file.path(destdir, prefix)
  write.xlsx(x = getResultList, file = paste0(outpath, ".xlsx"))
  return(res)
}


# remove STAD, CHOL, KICH and DLBC
DEG_pancan2 <- DEG_pancan2[setdiff(names(DEG_pancan2), c("STAD", "CHOL", "KICH", "DLBC"))]
GSEA_list <- Map(goGSEA, DEG_pancan2, names(DEG_pancan2))
save(GSEA_list, file = "results/GSEA_results_rm_notEnriched.RData")

#------- GSEA example plot
# load("results/GSEA_results.RData")
load("results/GSEA_results_rm_notEnriched.RData")
library(clusterProfiler)

gseaplot(GSEA_list$SKCM$hallmark, geneSetID = "HALLMARK_INTERFERON_GAMMA_RESPONSE")

# gseaplot(egmt2, geneSetID = "INTEGRAL_TO_PLASMA_MEMBRANE")
# gseaplot(res$hallmark, geneSetID = "HALLMARK_INTERFERON_GAMMA_RESPONSE")

#------ get main result cloumn
gsea_results <- lapply(GSEA_list, function(x) {
  # x@result %>% select()
  lapply(x, function(gsea) {
    gsea@result %>% dplyr::select(ID, setSize, enrichmentScore, NES, pvalue, qvalues)
  })
})

save(gsea_results, file = "results/GSEA_results_notEnriched_small.RData")
rm(GSEA_list)
gc()

#----- plot
summariseGSEA <- function(pathway = NULL) {
  purrr::reduce(Map(function(x, y, pathway) {
    if (nrow(x[[pathway]]) == 0) {
      res <- NULL
    } else {
      res <- x[[pathway]]
      res$Project <- y
    }
    res
  }, gsea_results, names(gsea_results), pathway), rbind)
}

gsea_summary <- list()
gsea_summary$hallmark <- summariseGSEA(pathway = "hallmark")
# gsea_summary$c2_kegg = summariseGSEA(pathway = "c2_kegg")
gsea_summary$c2_reactome <- summariseGSEA(pathway = "c2_reactome")
gsea_summary$c5_mf <- summariseGSEA(pathway = "c5_mf")
gsea_summary$c5_bp <- summariseGSEA(pathway = "c5_bp")
gsea_summary$c6 <- summariseGSEA(pathway = "c6")
gsea_summary$c7 <- summariseGSEA(pathway = "c7")

save(gsea_summary, file = "results/GSEA_summary_notEnriched.RData")

## plot really
load("../data/df_combine_gsva_clinical.RData")

df.gsva %>%
  filter(!is.na(APM)) %>%
  group_by(Project) %>%
  summarize(MedianAPM = median(APM), N = n()) %>%
  arrange(MedianAPM) -> sortAPM
sortAPM %>% filter(Project != "DLBC") -> sortAPM

library(scales)
gsea_summary$hallmark %>%
  ggplot(mapping = aes(x = Project, y = ID)) +
  geom_tile(aes(fill = NES)) +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red") +
  scale_x_discrete(limits = sortAPM$Project) + theme_classic() +
  theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1))

# df_wide = reshape2::dcast(gsea_summary$hallmark, ID ~ Project, value.var="NES", fill = 0)

library(pheatmap)

plotPathway <- function(df, save = FALSE, path = NULL, silent = FALSE) {
  df_wide <- reshape2::dcast(df, ID ~ Project, value.var = "NES", fill = 0)

  breaksList <- seq(-max(df_wide[, -1], na.rm = TRUE), max(df_wide[, -1], na.rm = TRUE), by = 0.01)
  pheatmap(df_wide %>% column_to_rownames(var = "ID"),
    color = colorRampPalette(c("blue", "white", "red"))(length(breaksList)),
    breaks = breaksList,
    # annotation_col = annotation_col,
    fontsize_row = 8, fontsize_col = 8, silent = silent,
    cellheight = 10, cellwidth = 10, filename = ifelse(save == FALSE, NA, path)
  )
}

gg <- plotPathway(gsea_summary$hallmark, silent = F, path = "results/GSEA_results_plot/Hallmark_Enriched.pdf", save = TRUE)
# gg = plotPathway(gsea_summary$c2_kegg, silent = F, path = "results/GSEA_results_plot/KEGG_Enriched.pdf", save = TRUE)
gg <- plotPathway(gsea_summary$c2_reactome, silent = F, path = "results/GSEA_results_plot/Reactome_Enriched.pdf", save = TRUE)
gg <- plotPathway(gsea_summary$c5_mf, silent = F, path = "results/GSEA_results_plot/GO_MolecuFunction_Enriched.pdf", save = TRUE)
gg <- plotPathway(gsea_summary$c5_bp, silent = F, path = "results/GSEA_results_plot/GO_BiologicalProcess_Enriched.pdf", save = TRUE)
gg <- plotPathway(gsea_summary$c6, silent = F, path = "results/GSEA_results_plot/C6_oncogenicsignatures_Enriched.pdf", save = TRUE)
gg <- plotPathway(gsea_summary$c7, silent = F, path = "results/GSEA_results_plot/C7_immunologicsignatures_Enriched.pdf", save = TRUE)

# gg


# _-------- rank pheatmap by NES value
plotPathway2 <- function(df, save = FALSE, path = NULL, silent = FALSE) {
  df_wide <- reshape2::dcast(df, ID ~ Project, value.var = "NES", fill = 0)
  df_wide <- df_wide[order(rowMeans(df_wide[, -1]), decreasing = TRUE), ]
  rownames(df_wide) <- NULL

  breaksList <- seq(-max(df_wide[, -1], na.rm = TRUE), max(df_wide[, -1], na.rm = TRUE), by = 0.01)

  pheatmap(df_wide %>% column_to_rownames(var = "ID"),
    color = colorRampPalette(c("blue", "white", "red"))(length(breaksList)),
    breaks = breaksList, cluster_rows = FALSE,
    # annotation_col = annotation_col,
    fontsize_row = 8, fontsize_col = 8, silent = silent,
    cellheight = 10, cellwidth = 10, filename = ifelse(save == FALSE, NA, path)
  )
}

gg <- plotPathway2(gsea_summary$c2_reactome, silent = F, path = "results/GSEA_results_plot/Reactome_Enriched2.pdf", save = TRUE)

Session Info

sessionInfo()
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: CentOS Linux 7 (Core)
#> 
#> Matrix products: default
#> BLAS:   /home/public/R/R-base/lib64/R/lib/libRblas.so
#> LAPACK: /home/public/R/R-base/lib64/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] grid      stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] forestplot_1.9      checkmate_1.9.4     survival_3.1-7     
#>  [4] ggpubr_0.2.3        magrittr_1.5        scales_1.0.0       
#>  [7] corrplot_0.84       pheatmap_1.0.12     forcats_0.4.0      
#> [10] stringr_1.4.0       purrr_0.3.3         tidyr_1.0.0        
#> [13] tibble_2.1.3        ggplot2_3.2.1       tidyverse_1.2.1    
#> [16] readr_1.3.1         dplyr_0.8.3         UCSCXenaTools_1.2.7
#> [19] rmdformats_0.3.5    knitr_1.26          pacman_0.5.1       
#> 
#> loaded via a namespace (and not attached):
#>  [1] utf8_1.1.4             questionr_0.7.0        R.utils_2.9.0         
#>  [4] tidyselect_0.2.5       lme4_1.1-21            robust_0.4-18.1       
#>  [7] htmlwidgets_1.5.1      munsell_0.5.0          codetools_0.2-16      
#> [10] DT_0.10                future_1.15.0          miniUI_0.1.1.1        
#> [13] withr_2.1.2            Brobdingnag_1.2-6      metaBMA_0.6.2         
#> [16] colorspace_1.4-1       highr_0.8              rstudioapi_0.10       
#> [19] stats4_3.6.0           DescTools_0.99.30      robustbase_0.93-5     
#> [22] rcompanion_2.3.7       ggsignif_0.6.0         listenv_0.7.0         
#> [25] labeling_0.3           emmeans_1.4.2          rstan_2.19.2          
#> [28] mnormt_1.5-5           MCMCpack_1.4-4         bridgesampling_0.7-2  
#> [31] coda_0.19-3            vctrs_0.2.0            generics_0.0.2        
#> [34] TH.data_1.0-10         metafor_2.1-0          xfun_0.11             
#> [37] R6_2.4.0               BayesFactor_0.9.12-4.2 reshape_0.8.8         
#> [40] logspline_2.1.15       assertthat_0.2.1       promises_1.1.0        
#> [43] multcomp_1.4-10        ggExtra_0.9            gtable_0.3.0          
#> [46] multcompView_0.1-7     globals_0.12.4         processx_3.4.1        
#> [49] mcmc_0.9-6             sandwich_2.5-1         rlang_0.4.1           
#> [52] MatrixModels_0.4-1     EMT_1.1                zeallot_0.1.0         
#> [55] splines_3.6.0          TMB_1.7.15             lazyeval_0.2.2        
#> [58] broom_0.5.2            inline_0.3.15          abind_1.4-5           
#> [61] yaml_2.2.0             reshape2_1.4.3         modelr_0.1.5          
#> [64] crosstalk_1.0.0        backports_1.1.5        httpuv_1.5.2          
#> [67] tools_3.6.0            bookdown_0.15          psych_1.8.12          
#> [70] ellipsis_0.3.0         RColorBrewer_1.1-2     WRS2_1.0-0            
#> [73] ez_4.4-0               Rcpp_1.0.3             plyr_1.8.4            
#>  [ reached getOption("max.print") -- omitted 98 entries ]

Immunotherapy datasets analyses

This part will clearly describe how to analyze APS, TMB and TIGS in immunotherapy datasets, including cancer type level and individual level. Raw data used for these analyses have been preprocessed, detail please read preprocessing part or TCGA pan-cancer analyses of this analysis report.

TIGS and pan-cancer objective response rate to PD-1 inhibition

Previous study has shown that TMB could predict pan-cancer ICI objective response rates (ORR). Here we will evaluate and compare the predictive power of APS, TIGS with TMB in pan-cancer ICI objective response rates prediction. The ORR for anti–PD-1 or anti–PD-L1 therapy will be plotted against the corresponding median APS, TIGS, TMB across multiple cancer types.

Through an extensive literature search, we identified 26 tumor types or subtypes for which data regarding the ORR are available. For each tumor type, we pooled the response data from the largest published studies that evaluated the ORR. We included only studies of anti–PD-1 or anti–PD-L1 monotherapy that enrolled at least 10 patients who were not selected for PD-L1 tumor expression (Identified individual studies and references are available in the manuscript Supplementary Table 3). The median tumor mutational burden for each tumor type was obtained from a validated comprehensive genomic profiling assay performed and provided by Foundation Medicine. The APS information for 23 tumor types were calculated based on TCGA datasets, and the APS for merkel cell carcinoma, cutaneous squamous cell carcinoma and small-cell lung cancer were calculated based on GEO microarray datasets.

Combination of datasets

In code implementation, we firstly combine APM score, TIGS score from TCGA studies and GEO datasets.

Using the combined dataset df_all, we can normalize APM score to [0, 1] region, then we calculate median APM score for each tumor type.

Lastly, we merge this data to 25 tumor types or subtypes for which data regarding the ORR, TMB are available by hand using Excel. This process has been double checked. With the merged data, we can explore association of ORR and APM score, ORR and TMB etc. in immunotherapy datasets.

Significant correlation between APS, TMB and the ORR

We plot ORR against APS, TMB respectively, and fit their relationship with linear model. We observe that significant correlation between APS, TMB and the ORR.

Load pan-cancer ORR data and show it as a table.

DLBCL has high APM and TIGS, yet the response rate in Figure 4 of over 50% is incorrect. This 50%+ response rate is for Hodgkin lymphoma, not DLBCL.

We did mix the DLBC and Hodgkin lymphoma, thus we remove it and update our analyses.

Focus on cancer type with APS, fit data with linear model.

Of note, here we use log(TMB) will not change the relationshipe between ORR and TMB.

We observe that 17.5% variance of ORR can be explained by APS, and 50.5% variance of ORR can be explained by TMB. The result of TMB is similar to NEJM paper Tumor Mutational Burden and Response Rate to PD-1 inhibition. This result show that TMB and APS are biomarker that predicting ORR of immunotherapy, TMB is better than APS.

Next we plot the linear model.

TIGS definition and performance

From the aspect of biological process, TMB release is independent from processing and presentation of tumor antigens. Now that we have found TMB and APM are biomarkers which can predict object reponse rate of cancer type in immunotherapy, these two factors are independent, and are both required for tumor immunogenicity determination, why not integrate them as a new biomarker which should be better than TMB?

Therefore, theoretically, tumor immunogenicity can be represented as

\[ [“Tumor \quad antigenicity”] \times [“Antigen \quad processing \quad and \quad presenting \quad status”]. \]

We use log(TMB) here because log(TMB) has been proved to have a linear relationship with ORR. So the formula can be written as

\[ TIGS = APS_{normalized} \times log(TMB) \]

However, some tumors have TMB level below 1 mutation/ Mb, to avoid minus number in quantifying “tumor antigenicity”, we add number 1 to all normalized TMB. So in this case, the TIGS formula is

\[ TIGS = APS_{normalized} \times log(TMB + 1) \]

When TMB < 1 mutation/Mb, the log(TMB) would less than 0, if APS is very big, we would get a big minus number in this case, which against common sense.

Now we construct a linear model by integrate TMB and APM.

We observe that 60.1% variance of ORR can be explained by TIGS, this result is indeed bettern than TMB.

More exploration of TIGS formula

Would other combination of TMB and APS be better?

As mentioned above, we constructed the formula of TIGS according to our understanding of “tumor antigenicity”. Is there a better combination of these two factors?

There are two basic models we take into consideration.

  • Model 1: \(ORR = a \times (APS \times log(TMB)) + b\), i.e. \(TIGS = APS \times log(TMB)\) (This is the model we used above)
  • Model 2: \(ORR = e \times (a \times APS + b\times log(TMB) + c\times APS \times log(TMB)) + d\), i.e. \(TIGS = a \times APS + b\times log(TMB) + c\times APS \times log(TMB)\)

a, b, c, d and e here are coefficients. In R, we don’t need to write them in models, it will take care of them.

Now, let’s observe the results.

Model 2 is a little better than model 1, however model 1 is much simpler (all terms of model 1 is included in model 2!). Following Occam’s razor, we choose model 1 when compare it with model 2.

Any promotion of formula \(TIGS = APS \times log(TMB + 1)\)?

We still have a question about TIGS formula. The TMB in formula is calculated as

\[ TMB = ln(\frac{whole \quad exome \quad mutation \quad number}{38} + 1 ) \]

We choose nonsynonymous mutations because they have functional effects. Only functional effects on protein will generate neo-antigen (peptide).

We use 38 here because 38 MB is the estimated length of exome, so we can normalize TMB to per MB. However, in math model, there may be a better value “a” than 38. We don’t know yet, so we need to explore this “a” value.

We simulate “a” value from 0.1 to 10000 with step size 0.1 in the following formula, the resulted TIGS is then used to fit a linear model for pan-cancer ICI ORR prediction. We calculate differnt R Square value under each “a” and see how it changes.

\[ TIGS = APS_{normalized} \times ln(\frac{whole \quad exome \quad mutation \quad number}{a} + 1) \]

Implementation:

Plot the simulation result.

We can find that R squares of linear model have a maximum value when “a” is around 40. Therefore, “a=38” is optimized to rescale the raw whole exome mutation counts and balance the contribution for APM and TMB in TIGS formula.

Performance comparison on immunotherapy clinical response prediction and survival benefit

Description of datasets please see preprocessing part or “Method Section” of manuscript.

TMB is a well-known biomarker and has been applied. Considering our TIGS is based on TMB, thus the most important point we wanna do in this section is comparing the prediction/clinical perfermance between TMB and TIGS.

However, some other biomarkers can predict immunotherapy response have been studied by other groups. Therefore, this section we will include following biomarkers in comparsion.

  • APS - antigen processing and presenting efficiency score
  • TMB - tumor mutation burden
  • TIGS - tumor immunogenicity score
  • IIS - immune infiltration score
  • IFNG - interferon gamma response biomarkers of 6 genes including IFNG, STAT1, IDO1, CXCL10, CXCL9, and HLA-DRA
  • CD8 - gene expression level of CD8A + CD8B
  • PDL1 - an immunohistochemistry (IHC) biomarker approved by FDA. This study we use PD-L1 gene expression as the IHC surrogate
  • TIDE - TIDE signature which integrates the expression signatures of T cell dysfunction and T cell exclusion to model tumor immune evasion.
  • ISG.RS and IFNG.GS signatures described in Benci et al, Cell 2019
  • PHBR scores from MHC-I Genotype Restricts the Oncogenic Mutational Landscape and Evolutionary Pressure against MHC Class II Binding Cancer Mutations

The predicted values of gene expression biomarkers (i.e. IFNG, CD8, PDL1) are the average values among all members defined by the original publications. For ISG.RS and IFNG.GS, we follow the reference to calculate normlized score.

We will also try to combine these gene expression based signatures with TMB to see their performance.

Preprocessing of immunotherapy datasets

The first step is to preprocess 3 immunotherapy datasets with gene expression, TMB and clinical outcome available. APS from Pancan analysis is used to normalize APS here.

Hugo et al 2016 dataset

## Data source: Genomic and Transcriptomic Features of Response to Anti-PD-1 Therapy in Metastatic Melanoma. Cell 2016 Mar 24;165(1):35-44. PMID: 26997480
suppressPackageStartupMessages(library(GEOquery))
library(tidyverse)
library(readxl)
geo_dir <- "../data/GEOdata"

gse <- "GSE78220"
GSE_78220 <- getGEO(gse, GSEMatrix = TRUE, AnnotGPL = TRUE, destdir = geo_dir)
GSE_78220 <- GSE_78220$GSE78220_series_matrix.txt.gz

#### Download gene expression using following command
# download.file(url = "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE78220&format=file&file=GSE78220%5FPatientFPKM%2Exlsx", destfile = "../data/GSE78220_FPKM.xlsx")

GSE78220_FPKM <- readxl::read_xlsx("../data/GSE78220_FPKM.xlsx")

GSE78220_FPKM <- GSE78220_FPKM %>%
  as.data.frame() %>%
  mutate(mean_expr = rowMeans(.[, -1], na.rm = TRUE)) %>%
  arrange(Gene, desc(mean_expr)) %>%
  distinct(Gene, .keep_all = TRUE) %>%
  dplyr::select(-mean_expr) %>%
  as.tibble()

# pData(GSE_78220) %>% View()

# normalize gene expression as log(x + 1)
GSE78220_Norm <- GSE78220_FPKM
GSE78220_Norm[, -1] <- log2(GSE78220_Norm[, -1] + 1)

# apply gsva
applyGSVA(merged_geneList,
  group_col = "Cell_type",
  gene_col = "Symbol", ExprMatList = list(GSE78220_Norm), method = "gsva"
) -> gsva.hugo2016

gsva.hugo2016 <- calc_TisIIs(gsva.hugo2016[[1]])

# normalize APS
APS_hugo <- gsva.hugo2016$APM
APS_hugo <- (APS_hugo - min_APS) / (max_APS - min_APS)

names(APS_hugo) <- colnames(GSE78220_Norm)[-1]
APS_hugo
names(APS_hugo) <- sub(pattern = "(Pt.*)\\.baseline", "\\1", names(APS_hugo))


# keep biggest value
APS_hugo <- APS_hugo[-21]
names(APS_hugo)[c(14, 20)] <- c("Pt16", "Pt27")
APS_hugo

# keep corresponding expression data for downstream analysis
rna_hugo <- GSE78220_Norm[, -22]
colnames(rna_hugo)[-1] <- names(APS_hugo)

if (!file.exists("results/Hugo2016_RNA.RData")) {
  save(rna_hugo, file = "results/Hugo2016_RNA.RData")
}

GSE78220.APS <- tibble(PatientID = names(APS_hugo), APS = APS_hugo, IIS = gsva.hugo2016$IIS[-21])

# generate results for random selected APS genes
set.seed(123456)
apm_gens <- merged_geneList %>% dplyr::filter(Cell_type == "APM")
sapply(1:100, function(i) {
  tempGens <- sample(GSE78220_Norm$Gene, 18)
  apm_gens$Symbol <- tempGens
  applyGSVA(apm_gens,
    group_col = "Cell_type",
    gene_col = "Symbol", ExprMatList = list(GSE78220_Norm), method = "gsva"
  )
}) -> gsva.hugo2016.random

gsva.hugo2016.random <- as.data.frame(sapply(gsva.hugo2016.random, function(x) {
  (x$APM - min_APS) / (max_APS - min_APS)
}))
colnames(gsva.hugo2016.random) <- paste0("APS", 1:100)
gsva.hugo2016.random <- gsva.hugo2016.random[-21, ]
rownames(gsva.hugo2016.random) <- names(APS_hugo)

# read clincal and tmb data
GSE_78220.TMB <- read_csv("../data/Cell2016.csv", col_types = cols())
GSE_78220.TMB$nTMB <- GSE_78220.TMB$TotalNonSyn / 38

info_hugo <- full_join(GSE_78220.TMB, GSE78220.APS, by = "PatientID")
info_hugo <- info_hugo[, c(1:5, 14:16, 13)]

# calculate TIGS score
info_hugo$TIGS <- log(info_hugo$nTMB) * info_hugo$APS

save(info_hugo, file = "results/Hugo2016_Info.RData")

Van Allen et al 2015 dataset

#----------------------------------------------------------
# Data source :
# Science 2015: Genomic correlates of response to CTLA4 blockade in metastatic melanoma
#--------------------------
library(readxl)
suppressPackageStartupMessages(library(tidyverse))
science2015_TMB <- read_xlsx("../data/Science2015_TMB_List_AllPatients.xlsx")
science2015_cli1 <- read_xlsx("../data/Science2015_Clinical.xlsx", sheet = 1)
science2015_cli2 <- read_xlsx("../data/Science2015_Clinical.xlsx", sheet = 2)
science2015_rna <- read_tsv("../data/MEL-IPI-Share.rpkm.gct", col_types = cols())

#---- Preprocess
vc.nonSilent <- c(
  "Frame_Shift_Del", "Frame_Shift_Ins", "Splice_Site", "Translation_Start_Site",
  "Nonsense_Mutation", "Nonstop_Mutation", "In_Frame_Del",
  "In_Frame_Ins", "Missense_Mutation"
)

science2015_TMB %>%
  group_by(patient) %>%
  summarise(
    TotalMutation = n(),
    TotalNonSyn = sum(Variant_Classification %in% vc.nonSilent),
    nTMB = TotalNonSyn / 38
  ) -> science2015_TMB_summary

#---- remove duplicate symbols of rna data
science2015_rna %>%
  mutate(symbol = sub("(.*)\\..*$", "\\1", Description)) %>%
  dplyr::select(-Name, -Description) %>%
  dplyr::select(symbol, everything()) %>%
  as.data.frame() %>%
  mutate(mean_expr = rowMeans(.[, -1], na.rm = TRUE)) %>%
  arrange(symbol, desc(mean_expr)) %>%
  distinct(symbol, .keep_all = TRUE) %>%
  dplyr::select(-mean_expr) %>%
  as.tibble() -> science2015_ge

length(intersect(science2015_ge$symbol, GSE78220_Norm$Gene))
# the gene number of science paper is bigger than hugo 2016
# maybe miRNA and some other RNA included
# keep them almost equal\
science2015_ge <- science2015_ge %>%
  filter(symbol %in% GSE78220_Norm$Gene)

science2015_ge[, -1] <- log2(science2015_ge[, -1] + 1)

applyGSVA(merged_geneList,
  group_col = "Cell_type",
  gene_col = "Symbol",
  ExprMatList = list(science2015_ge), method = "gsva"
) -> gsva.VanAllen2015

gsva.VanAllen2015 <- calc_TisIIs(gsva.VanAllen2015[[1]])

APS_VanAllen <- gsva.VanAllen2015$APM
names(APS_VanAllen) <- colnames(science2015_ge)[-1]
APS_VanAllen <- (APS_VanAllen - min_APS) / (max_APS - min_APS)

APS_VanAllen
names(APS_VanAllen) <- sub("^MEL.IPI_(.*)\\.Tumor.*$", "\\1", names(APS_VanAllen))
APS_VanAllen

rna_VanAllen <- science2015_ge
colnames(rna_VanAllen)[-1] <- names(APS_VanAllen)
## save gene experssion
if (!file.exists("results/VanAllen2015_RNA.RData")) {
  save(rna_VanAllen, file = "results/VanAllen2015_RNA.RData")
}

APS_VanAllen_df <- tibble(
  patient = names(APS_VanAllen), APS = APS_VanAllen,
  IIS = gsva.VanAllen2015$IIS
)


# generate results for random selected APS genes
set.seed(123456)
# apm_gens = merged_geneList %>% dplyr::filter(Cell_type == "APM")
sapply(1:100, function(i) {
  tempGens <- sample(science2015_ge$symbol, 18)
  apm_gens$Symbol <- tempGens
  applyGSVA(apm_gens,
    group_col = "Cell_type",
    gene_col = "Symbol", ExprMatList = list(science2015_ge), method = "gsva"
  )
}) -> gsva.VanAllen2015.random

gsva.VanAllen2015.random <- as.data.frame(sapply(gsva.VanAllen2015.random, function(x) {
  (x$APM - min_APS) / (max_APS - min_APS)
}))
colnames(gsva.VanAllen2015.random) <- paste0("APS", 1:100)
rownames(gsva.VanAllen2015.random) <- names(APS_VanAllen)


science2015_cli <- bind_rows(
  dplyr::select(
    science2015_cli1, patient, age_start, RECIST, overall_survival, progression_free, primary, group,
    histology, stage, gender, dead, progression, neos50
  ),
  dplyr::select(
    science2015_cli2, patient, age_start, RECIST, overall_survival, progression_free, primary,
    group, histology, stage, gender, dead, progression
  ) %>% filter(patient %in% c("Pat20", "Pat91"))
)
# combine three datasets
info_VanAllen <- full_join(full_join(science2015_cli, APS_VanAllen_df, by = "patient"),
  science2015_TMB_summary,
  by = "patient"
)

info_VanAllen$TIGS <- info_VanAllen$APS * log(info_VanAllen$nTMB + 1) # to avoid TIGS <0, TIGS = log(TMB +1) *APM for this dataset

save(info_VanAllen, file = "results/VanAllen2015_Info.RData")

Snyder et al 2017 dataset

#----------------------------------------------------------------
# Contribution of systemic and somatic factors to clinical response and resistance in urothelial cancer: an exploratory multi-omic analysis
#---------------------------------------------------------------
# Data Source: https://github.com/XSLiuLab/multi-omic-urothelial-anti-pdl1
# This is a fork repository, original link can also be found in this link

library(tidyverse)
uro_cli <- read_csv("../data/data_clinical.csv", col_types = cols())
uro_counts <- read_csv("../data/data_counts.csv", col_types = cols())
uro_variants <- read_csv("../data/data_variants.csv",
  col_types = "c??????"
)
uro_rna <- read_csv("../data/data_kallisto.csv", col_types = cols())
uro_rna_wide <- reshape2::dcast(uro_rna, gene_name ~ patient_id, value.var = "est_counts")

# try remove duplicated genes
uro_rna_wide %>%
  as.data.frame() %>%
  mutate(mean_expr = rowMeans(.[, -1], na.rm = TRUE)) %>%
  arrange(gene_name, desc(mean_expr)) %>%
  distinct(gene_name, .keep_all = TRUE) %>%
  dplyr::select(-mean_expr) %>%
  as.tibble() -> uro_rna_ge

length(intersect(uro_rna_ge$gene_name, GSE78220_FPKM$Gene))
uro_rna_ge <- dplyr::filter(uro_rna_ge, gene_name %in% GSE78220_FPKM$Gene)
uro_rna_ge_Norm <- uro_rna_ge
uro_rna_ge_Norm[, -1] <- log2(uro_rna_ge_Norm[, -1] + 1)

## apply GSVA
applyGSVA(merged_geneList,
  group_col = "Cell_type",
  gene_col = "Symbol", ExprMatList = list(uro_rna_ge_Norm), method = "gsva"
) -> gsva.Snyder

gsva.Snyder <- calc_TisIIs(gsva.Snyder[[1]])

APS_Snyder <- gsva.Snyder$APM
APS_Snyder <- (APS_Snyder - min_APS) / (max_APS - min_APS)

## save gene expression
rna_Snyder <- uro_rna_ge_Norm
if (!file.exists("results/Snyder2017_RNA.RData")) {
  save(rna_Snyder, file = "results/Snyder2017_RNA.RData")
}

APS_Snyder_df <- tibble(
  patient_id = colnames(uro_rna_ge_Norm)[-1],
  APS = APS_Snyder,
  IIS = gsva.Snyder$IIS
)

# generate results for random selected APS genes
set.seed(123456)
# apm_gens = merged_geneList %>% dplyr::filter(Cell_type == "APM")
sapply(1:100, function(i) {
  tempGens <- sample(uro_rna_ge_Norm$gene_name, 18)
  apm_gens$Symbol <- tempGens
  applyGSVA(apm_gens,
    group_col = "Cell_type",
    gene_col = "Symbol", ExprMatList = list(uro_rna_ge_Norm), method = "gsva"
  )
}) -> gsva.Snyder.random

gsva.Snyder.random <- as.data.frame(sapply(gsva.Snyder.random, function(x) {
  (x$APM - min_APS) / (max_APS - min_APS)
}))
colnames(gsva.Snyder.random) <- paste0("APS", 1:100)
rownames(gsva.Snyder.random) <- colnames(uro_rna_ge_Norm)[-1]

uro_cli_2 <- uro_cli %>%
  dplyr::select(patient_id, Age, Sex, is_benefit, is_benefit_os, os, `Alive Status`) %>%
  mutate(event = ifelse(`Alive Status` == "Y", 0, 1))

uro_tmb <- uro_variants %>%
  group_by(patient_id) %>%
  summarise(TMB = n(), Nonsyn = sum(!is.na(gene_name)), nTMB = TMB / 38) %>%
  mutate(patient_id = as.character(patient_id))

info_Snyder <- dplyr::full_join(uro_cli_2, dplyr::full_join(uro_tmb, APS_Snyder_df))
info_Snyder$TIGS <- info_Snyder$APS * log(info_Snyder$nTMB)

save(info_Snyder, file = "results/Snyder2017_Info.RData")



save(gsva.hugo2016.random, gsva.Snyder.random, gsva.VanAllen2015.random,
  file = "results/randomAPS.RData"
)

Gene expression biomarkers for response to immune checkpoint blockade

This section we will describe how to calculate predicted value of gene signature biomarker using gene expression data.

Check if all gene signatures are in data.

Next we retrieve all gene expression values and then calculate average.

Combine them with sigList, then transform these data.frame to long format, calculate mean gene expression and transform back to wide format.

Show results as tables.

Lastly, merge the three result datasets for downstream analysis.

ROC comparison

This section we compare ROC (AUC value) of biomarkers for response prediction.

library(pROC)
#> Type 'citation("pROC")' for a citation.
#> 
#> Attaching package: 'pROC'
#> The following objects are masked from 'package:stats':
#> 
#>     cov, smooth, var
library(tidyverse)
# theme_set(theme_bw())
# library(cowplot)

#------------------------------------------
# Load data
load("results/Hugo2016_Info.RData")
load("results/VanAllen2015_Info.RData")
load("results/Snyder2017_Info.RData")
load("results/randomAPS.RData")

load("results/GEBiomarker.RData")
load("results/APS_alternative.GSVA.RData")

# Of note, output csv from TIDE website has extra "" in last line
#          please remove it before load data
TIDE_hugo <- read_csv("results/TIDE_output_hugo.csv", col_types = cols())
TIDE_Snyder <- read_csv("results/TIDE_output_Snyder.csv", col_types = cols())
TIDE_VanAllen <- read_csv("results/TIDE_output_VanAllen.csv", col_types = cols())

#------------------------------------------
# Hugo 2016
info_hugo <- info_hugo %>%
  dplyr::left_join(dplyr::filter(GEBiomarker, study == "Hugo2016") %>%
                     dplyr::select(-study), by = c("PatientID" = "sample")) %>%
  dplyr::left_join(TIDE_hugo %>% dplyr::select(Patient, TIDE), by = c("PatientID" = "Patient")) %>%  # Combine APS7
  dplyr::left_join(APS7.GSVA %>% 
                     dplyr::select(rowname, APS7), 
                   by = c("PatientID" = "rowname")) %>% 
  dplyr::rename(APS7_Sig = APS7) %>%  # differ from APS random
  dplyr::left_join(APS_MHC_II.GSVA %>% 
                     dplyr::select(rowname, APS_MHC_II), 
                   by = c("PatientID" = "rowname"))
  

# combine random data
gsva.hugo2016.random %>%
  rownames_to_column(var = "PatientID") %>%
  full_join(info_hugo, by = "PatientID") -> info_hugo

# Because less TIDE shows better outcome and
# Ref says ISG resistance signature (ISG.RS) is also associated with resistance to ICB
# We need reverse their score
# scale them to [0, 1]
# Then combine TMB with gene expression based markers
info_hugo = info_hugo %>% 
  dplyr::mutate(
    TIDE = -TIDE,
    ISG.RS = -ISG.RS
  ) %>% 
  dplyr::mutate_at(c("CD8", "IFNG", "IFNG.GS", "ISG.RS", 
                     "PDL1", "TIDE", "APS7_Sig", "APS_MHC_II"), scales::rescale) %>% 
  dplyr::mutate_at(c("CD8", "IFNG", "IFNG.GS", "ISG.RS", 
                     "PDL1", "TIDE", "APS7_Sig", "APS_MHC_II"),
                   list(comb_TMB= ~ . * log(nTMB+1)))

genROC = function(info, markers, response="Response", direction="<", APSrandom=1:100) {
  ROC_LIST = list()
  # Calc APS random
  APSrandom_list = paste0("APS", APSrandom)
  for (i in c(markers,APSrandom_list)) {
    message(i)
    ROC_LIST[[i]] = roc(info[[response]], info[[i]], direction="<")
  }
  ROC_LIST
}

suppressMessages(
  hugo2016_roc <- genROC(info_hugo, markers = setdiff(colnames(info_hugo)[106:126], 
                                                   "Treatment"))
)

#---------------------------
# VanAllen 2015

info_VanAllen <- info_VanAllen %>%
  dplyr::left_join(dplyr::filter(GEBiomarker, study == "VanAllen2015") %>%
                     dplyr::select(-study), by = c("patient" = "sample")) %>%
  dplyr::left_join(TIDE_VanAllen %>% dplyr::select(Patient, TIDE), by = c("patient" = "Patient")) %>% 
  dplyr::left_join(APS7.GSVA %>% 
                     dplyr::select(rowname, APS7), 
                   by = c("patient" = "rowname")) %>% 
  dplyr::rename(APS7_Sig = APS7) %>%  # differ from APS random
  dplyr::left_join(APS_MHC_II.GSVA %>% 
                     dplyr::select(rowname, APS_MHC_II), 
                   by = c("patient" = "rowname"))

info_VanAllen2 <- info_VanAllen %>%
  mutate(Response = case_when(
    group == "response" ~ "R",
    group == "nonresponse" ~ "NR",
    TRUE ~ NA_character_
  )) %>%
  filter(!is.na(Response))


## -- using following command if only compare patients with TIGS availble
## -- the result are basically same
# info_VanAllen2 = info_VanAllen %>%
#     mutate(Response = case_when(
#         group == "response" ~ "R",
#         group == "nonresponse" ~"NR",
#         TRUE ~ NA_character_
#     )) %>% filter(!is.na(Response), !is.na(TIGS))

# combine random data
gsva.VanAllen2015.random %>%
  rownames_to_column(var = "patient") %>%
  full_join(info_VanAllen2, by = "patient") -> info_VanAllen2

info_VanAllen2 = info_VanAllen2 %>% 
  dplyr::mutate(
    TIDE = -TIDE,
    ISG.RS = -ISG.RS
  ) %>% 
  dplyr::mutate_at(c("CD8", "IFNG", "IFNG.GS", "ISG.RS", 
                     "PDL1", "TIDE", "APS7_Sig", "APS_MHC_II"), scales::rescale) %>% 
  dplyr::mutate_at(c("CD8", "IFNG", "IFNG.GS", "ISG.RS", 
                     "PDL1", "TIDE", "APS7_Sig", "APS_MHC_II"),
                   list(comb_TMB= ~ . * log(nTMB+1)))


suppressMessages(
  van2015_roc <- genROC(info_VanAllen2, 
                     markers = setdiff(colnames(info_VanAllen2)[114:136],
                                       c("TotalMutation", "TotalNonSyn", "Response")))
)

#---------------------------------------
# Snyder 2017

info_Snyder <- info_Snyder %>%
  dplyr::left_join(dplyr::filter(GEBiomarker, study == "Snyder2017") %>%
                     dplyr::select(-study), by = c("patient_id" = "sample")) %>%
  dplyr::left_join(TIDE_Snyder %>% dplyr::select(Patient, TIDE), by = c("patient_id" = "Patient")) %>% 
  dplyr::left_join(APS7.GSVA %>% 
                     dplyr::select(rowname, APS7), 
                   by = c("patient_id" = "rowname")) %>% 
  dplyr::rename(APS7_Sig = APS7) %>%  # differ from APS random
  dplyr::left_join(APS_MHC_II.GSVA %>% 
                     dplyr::select(rowname, APS_MHC_II), 
                   by = c("patient_id" = "rowname"))

# combine random data
gsva.Snyder.random %>%
  rownames_to_column(var = "patient_id") %>%
  full_join(info_Snyder, by = "patient_id") -> info_Snyder

info_Snyder = info_Snyder %>% 
  dplyr::mutate(
    TIDE = -TIDE,
    ISG.RS = -ISG.RS
  ) %>% 
  dplyr::mutate_at(c("CD8", "IFNG", "IFNG.GS", "ISG.RS", 
                     "PDL1", "TIDE", "APS7_Sig", "APS_MHC_II"), scales::rescale) %>% 
  dplyr::mutate_at(c("CD8", "IFNG", "IFNG.GS", "ISG.RS", 
                     "PDL1", "TIDE", "APS7_Sig", "APS_MHC_II"),
                   list(comb_TMB= ~ . * log(nTMB+1)))

suppressMessages(
  snyder2017_roc <- genROC(info_Snyder, 
                     markers = colnames(info_Snyder)[111:130], response = "is_benefit")
)


ggroc(list(TMB = van2015_roc$nTMB, TIGS = van2015_roc$TIGS, TIDE = van2015_roc$TIDE),
  legacy.axes = TRUE
) +
  labs(color = "Predictor") + 
  ggpubr::theme_classic2(base_size = 12)

From comparison of AUC value above, we can conclude that:

  • TIGS and TIDE are better than TMB
  • TIGS and TIDE have almost same performance in two melanoma datasets, however, performance of TIDE go down a little in urothelial cancer dataset, maybe resulting from the fact that TIDE is trained on data of melanoma and NSCLC cancer (this point is stated at TIDE website, cancer type except for melanoma and NSCLC may not work).
  • APS, TMB are also not bad predictors (they have almost same performance)
  • Other predictors are not good predictors, they perform differently in different datasets.

Next, we summary results above for final visualization.

Survival comparison

Above we already show APS, TMB, TIGS and TIDE are good predictors, this section we focus on them. We use cox model and separate patients into High and Low group in K-M survival analysis based on median biomarker predicted value.

library(survival)
library(survminer)

#----------------------
# Hugo 2016
hugo_os <- read_csv("../data/cell2016_OS.csv", col_types = cols())
info_hugo <- full_join(info_hugo, hugo_os, by = "PatientID")

# Cox model

# Cox result of TMB
coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ nTMB, data = info_hugo)
#> Call:
#> coxph(formula = Surv(OverallSurvival, VitalStatus == "Dead") ~ 
#>     nTMB, data = info_hugo)
#> 
#>       coef exp(coef) se(coef)  z   p
#> nTMB -0.03      0.97     0.02 -1 0.2
#> 
#> Likelihood ratio test=3  on 1 df, p=0.08
#> n= 37, number of events= 19 
#>    (1 observation deleted due to missingness)

# Cox result of APS
coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ APS, data = info_hugo)
#> Call:
#> coxph(formula = Surv(OverallSurvival, VitalStatus == "Dead") ~ 
#>     APS, data = info_hugo)
#> 
#>     coef exp(coef) se(coef)  z   p
#> APS -2.0       0.1      1.5 -1 0.2
#> 
#> Likelihood ratio test=2  on 1 df, p=0.2
#> n= 26, number of events= 12 
#>    (12 observations deleted due to missingness)

# Cox result of TIGS
coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ TIGS, data = info_hugo)
#> Call:
#> coxph(formula = Surv(OverallSurvival, VitalStatus == "Dead") ~ 
#>     TIGS, data = info_hugo)
#> 
#>      coef exp(coef) se(coef)  z    p
#> TIGS -1.2       0.3      0.6 -2 0.03
#> 
#> Likelihood ratio test=6  on 1 df, p=0.02
#> n= 26, number of events= 12 
#>    (12 observations deleted due to missingness)

# Cox result of TIDE
coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ TIDE, data = info_hugo)
#> Call:
#> coxph(formula = Surv(OverallSurvival, VitalStatus == "Dead") ~ 
#>     TIDE, data = info_hugo)
#> 
#>      coef exp(coef) se(coef)    z   p
#> TIDE -0.7       0.5      1.3 -0.5 0.6
#> 
#> Likelihood ratio test=0.3  on 1 df, p=0.6
#> n= 26, number of events= 12 
#>    (12 observations deleted due to missingness)


# K-M fit
info_hugo %>% # take care of NA values
  mutate(
    APS_Status = ifelse(APS > median(APS, na.rm = TRUE), "High",
      ifelse(is.na(APS), NA, "Low")
    ),
    # We reverse TIDE (do TIDE=-TIDE) in previous step
    # so here, we need to change the order
    TIDE_Status = ifelse(TIDE < median(TIDE, na.rm = TRUE), "High",
      ifelse(is.na(TIDE), NA, "Low")
    ),
    TMB_Status = ifelse(nTMB > median(nTMB), "High", "Low"),
    TIGS_Status = ifelse(TIGS > median(TIGS, na.rm = TRUE), "High",
      ifelse(is.na(TIGS), NA, "Low")
    )
  ) -> info_hugo2

# effect of TMB status on survival
fit1 <- survfit(Surv(OverallSurvival, VitalStatus == "Dead") ~ TMB_Status, data = info_hugo2)
ggsurvplot(fit1,
  data = info_hugo2, pval = TRUE, fun = "pct",
  xlab = "Time (in days)"
)


#----------------------
# Van Allen 2015

# Cox model

# Cox result of TMB
coxph(Surv(overall_survival, dead == 1) ~ nTMB, data = info_VanAllen)
#> Call:
#> coxph(formula = Surv(overall_survival, dead == 1) ~ nTMB, data = info_VanAllen)
#> 
#>        coef exp(coef) se(coef)  z   p
#> nTMB -0.007     0.993    0.006 -1 0.2
#> 
#> Likelihood ratio test=2  on 1 df, p=0.2
#> n= 110, number of events= 83 
#>    (2 observations deleted due to missingness)

# Cox result of APS
coxph(Surv(overall_survival, dead == 1) ~ APS, data = info_VanAllen)
#> Call:
#> coxph(formula = Surv(overall_survival, dead == 1) ~ APS, data = info_VanAllen)
#> 
#>     coef exp(coef) se(coef)  z    p
#> APS -1.7       0.2      0.8 -2 0.03
#> 
#> Likelihood ratio test=5  on 1 df, p=0.02
#> n= 42, number of events= 29 
#>    (70 observations deleted due to missingness)

# Cox result of TIGS
coxph(Surv(overall_survival, dead == 1) ~ TIGS, data = info_VanAllen)
#> Call:
#> coxph(formula = Surv(overall_survival, dead == 1) ~ TIGS, data = info_VanAllen)
#> 
#>      coef exp(coef) se(coef)  z    p
#> TIGS -0.5       0.6      0.3 -2 0.06
#> 
#> Likelihood ratio test=5  on 1 df, p=0.03
#> n= 40, number of events= 27 
#>    (72 observations deleted due to missingness)

# Cox result of TIDE
coxph(Surv(overall_survival, dead == 1) ~ TIDE, data = info_VanAllen)
#> Call:
#> coxph(formula = Surv(overall_survival, dead == 1) ~ TIDE, data = info_VanAllen)
#> 
#>      coef exp(coef) se(coef) z     p
#> TIDE  0.5       1.7      0.2 3 0.003
#> 
#> Likelihood ratio test=8  on 1 df, p=0.004
#> n= 42, number of events= 29 
#>    (70 observations deleted due to missingness)

info_VanAllen2 %>%
  #filter(group != "long-survival") %>%
  mutate(
    APS_Status = ifelse(APS > median(APS, na.rm = TRUE), "High",
      ifelse(is.na(APS), NA, "Low")
    ),
    # We reverse TIDE (do TIDE=-TIDE) in previous step
    # so here, we need to change the order
    TIDE_Status = ifelse(TIDE < median(TIDE, na.rm = TRUE), "High",
      ifelse(is.na(TIDE), NA, "Low")
    ),
    TMB_Status = ifelse(nTMB > median(nTMB, na.rm = TRUE), "High",
      ifelse(is.na(nTMB), NA, "Low")
    ),
    TIGS_Status = ifelse(TIGS > median(TIGS, na.rm = TRUE), "High",
      ifelse(is.na(TIGS), NA, "Low")
    )
  ) -> info_VanAllen3

# effect of TMB status on survival
fit1 <- survfit(Surv(overall_survival, dead == 1) ~ TMB_Status, data = info_VanAllen3)
ggsurvplot(fit1,
  data = info_VanAllen3, pval = TRUE, fun = "pct",
  xlab = "Time (in days)"
)


#----------------------
# Snyder 2017

# Cox model

# Cox result of TMB
coxph(Surv(os, event) ~ nTMB, data = info_Snyder)
#> Call:
#> coxph(formula = Surv(os, event) ~ nTMB, data = info_Snyder)
#> 
#>       coef exp(coef) se(coef)  z   p
#> nTMB -0.02      0.98     0.02 -1 0.3
#> 
#> Likelihood ratio test=1  on 1 df, p=0.3
#> n= 22, number of events= 14 
#>    (6 observations deleted due to missingness)

# Cox result of APS
coxph(Surv(os, event) ~ APS, data = info_Snyder)
#> Call:
#> coxph(formula = Surv(os, event) ~ APS, data = info_Snyder)
#> 
#>     coef exp(coef) se(coef)    z   p
#> APS -1.0       0.4      1.1 -0.9 0.4
#> 
#> Likelihood ratio test=0.8  on 1 df, p=0.4
#> n= 25, number of events= 17 
#>    (3 observations deleted due to missingness)

# Cox result of TIGS
coxph(Surv(os, event) ~ TIGS, data = info_Snyder)
#> Call:
#> coxph(formula = Surv(os, event) ~ TIGS, data = info_Snyder)
#> 
#>      coef exp(coef) se(coef)  z   p
#> TIGS -0.5       0.6      0.3 -1 0.2
#> 
#> Likelihood ratio test=2  on 1 df, p=0.1
#> n= 22, number of events= 14 
#>    (6 observations deleted due to missingness)

# Cox result of TIDE
coxph(Surv(os, event) ~ TIDE, data = info_Snyder)
#> Call:
#> coxph(formula = Surv(os, event) ~ TIDE, data = info_Snyder)
#> 
#>      coef exp(coef) se(coef) z   p
#> TIDE    1         3        1 1 0.2
#> 
#> Likelihood ratio test=1  on 1 df, p=0.2
#> n= 25, number of events= 17 
#>    (3 observations deleted due to missingness)


info_Snyder %>%
  mutate(
    APS_Status = ifelse(APS > median(APS, na.rm = TRUE), "High",
      ifelse(is.na(APS), NA, "Low")
    ),
    # We reverse TIDE (do TIDE=-TIDE) in previous step
    # so here, we need to change the order
    TIDE_Status = ifelse(TIDE < median(TIDE, na.rm = TRUE), "High",
      ifelse(is.na(TIDE), NA, "Low")
    ),
    TMB_Status = ifelse(nTMB > median(nTMB, na.rm = TRUE), "High",
      ifelse(is.na(nTMB), NA, "Low")
    ),
    TIGS_Status = ifelse(TIGS > median(TIGS, na.rm = TRUE), "High",
      ifelse(is.na(TIGS), NA, "Low")
    )
  ) -> info_Snyder2

# effect of TMB status on survival
fit1 <- survfit(Surv(os, event) ~ TMB_Status, data = info_Snyder2)
ggsurvplot(fit1,
  data = info_Snyder2, pval = TRUE, fun = "pct",
  xlab = "Time (in days)"
)

From comparison of cox results and K-M plots above, we can conclude that:

  • TIGS is very robust, in all three datasets, “High group” all show statistically significant better survival than “Low group”.
  • TIDE performs just so-so in Hugo 2016 dataset, very good in Van Allen 2015 dataset. But, it losts its power in Snyder 2017 dataset, patients with low TIDE value should have better survival but in this dataset patients with high TIDE value have better survival.
  • Patients with High TMB or APS in three datasets all show trends of better survival but not statistically significant.

Next, we summary results (cox HR value and K-M fit) above for final visualization.

# summary cox result
cox_TMB <- list()
cox_APS <- list()
cox_TIGS <- list()
cox_TIDE <- list()

summary(coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ nTMB, data = info_hugo)) -> cox_TMB[[1]]
summary(coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ APS, data = info_hugo)) -> cox_APS[[1]]
summary(coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ TIGS, data = info_hugo)) -> cox_TIGS[[1]]
summary(coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ TIDE, data = info_hugo)) -> cox_TIDE[[1]]

summary(coxph(Surv(overall_survival, dead == 1) ~ nTMB, data = info_VanAllen)) -> cox_TMB[[2]]
summary(coxph(Surv(overall_survival, dead == 1) ~ APS, data = info_VanAllen)) -> cox_APS[[2]]
summary(coxph(Surv(overall_survival, dead == 1) ~ TIGS, data = info_VanAllen)) -> cox_TIGS[[2]]
summary(coxph(Surv(overall_survival, dead == 1) ~ TIDE, data = info_VanAllen)) -> cox_TIDE[[2]]

summary(coxph(Surv(os, event) ~ nTMB, data = info_Snyder)) -> cox_TMB[[3]]
summary(coxph(Surv(os, event) ~ APS, data = info_Snyder)) -> cox_APS[[3]]
summary(coxph(Surv(os, event) ~ TIGS, data = info_Snyder)) -> cox_TIGS[[3]]
summary(coxph(Surv(os, event) ~ TIDE, data = info_Snyder)) -> cox_TIDE[[3]]

cox_df <- tibble(
  Group = c(rep("Hugo2016", 4), rep("VanAllen2015", 4), rep("Snyder2017", 4)),
  Biomarker = rep(c("TMB", "APS", "TIGS", "TIDE"), 3),
  HR = c(
    as.numeric(cox_TMB[[1]]$conf.int[1]),
    as.numeric(cox_APS[[1]]$conf.int[1]),
    as.numeric(cox_TIGS[[1]]$conf.int[1]),
    as.numeric(cox_TIDE[[1]]$conf.int[1]),
    as.numeric(cox_TMB[[2]]$conf.int[1]),
    as.numeric(cox_APS[[2]]$conf.int[1]),
    as.numeric(cox_TIGS[[2]]$conf.int[1]),
    as.numeric(cox_TIDE[[2]]$conf.int[1]),
    as.numeric(cox_TMB[[3]]$conf.int[1]),
    as.numeric(cox_APS[[3]]$conf.int[1]),
    as.numeric(cox_TIGS[[3]]$conf.int[1]),
    as.numeric(cox_TIDE[[3]]$conf.int[1])
  ),
  Pvalue = c(
    as.numeric(cox_TMB[[1]]$logtest["pvalue"]),
    as.numeric(cox_APS[[1]]$logtest["pvalue"]),
    as.numeric(cox_TIGS[[1]]$logtest["pvalue"]),
    as.numeric(cox_TIDE[[1]]$logtest["pvalue"]),
    as.numeric(cox_TMB[[2]]$logtest["pvalue"]),
    as.numeric(cox_APS[[2]]$logtest["pvalue"]),
    as.numeric(cox_TIGS[[2]]$logtest["pvalue"]),
    as.numeric(cox_TIDE[[2]]$logtest["pvalue"]),
    as.numeric(cox_TMB[[3]]$logtest["pvalue"]),
    as.numeric(cox_APS[[3]]$logtest["pvalue"]),
    as.numeric(cox_TIGS[[3]]$logtest["pvalue"]),
    as.numeric(cox_TIDE[[3]]$logtest["pvalue"])
  )
)

# show summary data as table
DT::datatable(cox_df,
  options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE,
  caption = "HR of biomarkers for response prediction in three ICB datasets"
)

Visualization of comparison profile

This section we visualize comparsion profile, integrate analysis results of 3 immunotherapy datasets into a big figure.

Integrate ROC curve.

theme_set(cowplot::theme_cowplot())

## -------- ROC

# Van Allen
p_roc_1 <- ggroc(list(TMB = van2015_roc$nTMB,
                      TIGS = van2015_roc$TIGS,
                      TIDE = van2015_roc$TIDE), legacy.axes = TRUE, aes = c("color", "linetype")) +
  scale_color_manual("", values = c("blue", "red", "black")) +
  scale_linetype_discrete("") + 
  theme(legend.position = c(0.8, 0.4)) + 
  geom_segment(aes(x=0, xend=1, y =0, yend=1), color = "grey", linetype = "dashed")
# Hugo
p_roc_2 <- ggroc(list(TMB = hugo2016_roc$nTMB,
                      TIGS = hugo2016_roc$TIGS,
                      TIDE = hugo2016_roc$TIDE), legacy.axes = TRUE, aes = c("color", "linetype")) +
  scale_color_manual("", values = c("blue", "red", "black")) +
  scale_linetype_discrete("") + 
  theme(legend.position = c(0.8, 0.4)) +
  geom_segment(aes(x=0, xend=1, y =0, yend=1), color = "grey", linetype = "dashed")
# Snyder
p_roc_3 <- ggroc(list(TMB = snyder2017_roc$nTMB,
                      TIGS = snyder2017_roc$TIGS,
                      TIDE = snyder2017_roc$TIDE), legacy.axes = TRUE, aes = c("color", "linetype")) +
  scale_color_manual("", values = c("blue", "red", "black")) +
  scale_linetype_discrete("") + 
  theme(legend.position = c(0.8, 0.4)) +
  geom_segment(aes(x=0, xend=1, y =0, yend=1), color = "grey", linetype = "dashed")

p_roc_4 <- ggroc(list(
  TIGS = van2015_roc$TIGS, APS = van2015_roc$APS, PDL1 = van2015_roc$PDL1,
  CD8 = van2015_roc$CD8, IFNG = van2015_roc$IFNG, IIS = van2015_roc$IIS, 
  ISG.RS = van2015_roc$ISG.RS, IFNG.GS = van2015_roc$IFNG.GS), aes = c("color", "linetype"),
  legacy.axes = TRUE) +
  scale_color_discrete("") +
  scale_linetype_discrete("") + 
  theme(legend.position = c(0.8, 0.4)) + 
  geom_segment(aes(x=0, xend=1, y =0, yend=1), color = "grey", linetype = "dashed")

p_roc_5 <- ggroc(list(
  TIGS = hugo2016_roc$TIGS, APS = hugo2016_roc$APS, PDL1 = hugo2016_roc$PDL1,
  CD8 = hugo2016_roc$CD8, IFNG = hugo2016_roc$IFNG, IIS = hugo2016_roc$IIS,
  ISG.RS = hugo2016_roc$ISG.RS, IFNG.GS = hugo2016_roc$IFNG.GS), aes = c("color", "linetype"),
  legacy.axes = TRUE) +
  scale_color_discrete("") +
  scale_linetype_discrete("") + 
  theme(legend.position = c(0.8, 0.4)) + 
  geom_segment(aes(x=0, xend=1, y =0, yend=1), color = "grey", linetype = "dashed")

p_roc_6 <- ggroc(list(
  TIGS = snyder2017_roc$TIGS, APS = snyder2017_roc$APS, PDL1 = snyder2017_roc$PDL1,
  CD8 = snyder2017_roc$CD8, IFNG = snyder2017_roc$IFNG, IIS = snyder2017_roc$IIS,
  ISG.RS = snyder2017_roc$ISG.RS, IFNG.GS = snyder2017_roc$IFNG.GS), aes = c("color", "linetype"),
  legacy.axes = TRUE) +
  scale_color_discrete("") +
  scale_linetype_discrete("") + 
  theme(legend.position = c(0.8, 0.4)) +
  geom_segment(aes(x=0, xend=1, y =0, yend=1), color = "grey", linetype = "dashed")


p_extra <- plot_grid(p_roc_4, p_roc_5, p_roc_6, nrow = 1)
save_plot(filename = "extra_roc_comparison.pdf", plot = p_extra, nrow = 1, 
          ncol = 3, base_aspect_ratio = 1.5)
show_cols = c("APS", "APSr", "IIS", "TMB", "TIGS", "TIDE", "CD8", "PDL1", "IFNG", "IFNG.GS", "ISG.RS")
show_extcols = c("TIDE_comb_TMB", "APS7_Sig", "APS7_Sig_comb_TMB")

## -------- AUC values
auc_df_long <- auc_df %>%
  dplyr::select(c(colnames(auc_df)[1:20], "APSr", "Group")) %>% 
  tidyr::gather(key = "biomarker", value = "auc", -Group)

# p_auc_1 =
auc_df_long$auc <- round(auc_df_long$auc, digits = 3)

# AUC for PHBR comes from result of code/PHBR_comparison_*.R

auc_df_long %>%
  filter(Group == "VanAllen2015", biomarker %in% show_cols) %>%
  ggplot(aes(x = biomarker, y = auc)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  geom_text(aes(label = auc), hjust = 1) +
  geom_hline(yintercept = 0.5, linetype = 2) +
  ylim(0, 0.85) + ylab("AUC") + theme(axis.title.y = element_blank()) +
  coord_flip() -> p_auc_1

auc_df_long %>%
  filter(Group == "Hugo2016", biomarker %in% show_cols) %>%
  ggplot(aes(x = biomarker, y = auc)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  geom_text(aes(label = auc), hjust = 1) +
  geom_hline(yintercept = 0.5, linetype = 2) +
  ylim(0, 0.8) + ylab("AUC") + theme(axis.title.y = element_blank()) +
  coord_flip() -> p_auc_2

auc_df_long %>%
  filter(Group == "Snyder2017", biomarker %in% show_cols) %>%
  ggplot(aes(x = biomarker, y = auc)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  geom_text(aes(label = auc), hjust = 1) +
  geom_hline(yintercept = 0.5, linetype = 2) +
  ylim(0, 0.8) + ylab("AUC") + theme(axis.title.y = element_blank()) +
  coord_flip() -> p_auc_3

auc_df_long %>%
  filter(Group == "VanAllen2015", biomarker %in% show_extcols) %>%
  add_row(Group="VanAllen2015", biomarker=c("PHBR_I", "PHBR_I_comb_TMB"), auc=c(0.57, 0.71)) %>%  
  ggplot(aes(x = biomarker, y = auc)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  geom_text(aes(label = auc), hjust = 1) +
  geom_hline(yintercept = 0.5, linetype = 2) +
  ylim(0, 0.85) + ylab("AUC") + theme(axis.title.y = element_blank()) +
  coord_flip() -> p_auc_ext1

auc_df_long %>%
  filter(Group == "Hugo2016", biomarker %in% show_extcols) %>%
  ggplot(aes(x = biomarker, y = auc)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  geom_text(aes(label = auc), hjust = 1) +
  geom_hline(yintercept = 0.5, linetype = 2) +
  ylim(0, 0.8) + ylab("AUC") + theme(axis.title.y = element_blank()) +
  coord_flip() -> p_auc_ext2

auc_df_long %>%
  filter(Group == "Snyder2017", biomarker %in% show_extcols) %>%
  add_row(Group="Snyder2017", biomarker=c("PHBR_I", "PHBR_I_comb_TMB"), auc=c(0.68, 0.52)) %>%  
  ggplot(aes(x = biomarker, y = auc)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  geom_text(aes(label = auc), hjust = 1) +
  geom_hline(yintercept = 0.5, linetype = 2) +
  ylim(0, 0.8) + ylab("AUC") + theme(axis.title.y = element_blank()) +
  coord_flip() -> p_auc_ext3
#--- KM-plots
p_surv_APS <- list()
p_surv_TMB <- list()
p_surv_TIGS <- list()
p_surv_TIDE <- list()

p_surv_APS[[1]] <- ggsurvplot(km_APS$VanAllen,
  pval = TRUE, fun = "pct",
  xlab = "Time (in days)", palette = c("red", "blue"),
  legend = c(0.8, 0.9), legend.title = element_blank(), legend.labs = c("APS High", "APS Low")
)
p_surv_APS[[2]] <- ggsurvplot(km_APS$Hugo,
  pval = TRUE, fun = "pct",
  xlab = "Time (in days)", palette = c("red", "blue"),
  legend = c(0.8, 0.9), legend.title = element_blank(), legend.labs = c("APS High", "APS Low")
)
p_surv_APS[[3]] <- ggsurvplot(km_APS$Snyder,
  pval = TRUE, fun = "pct",
  xlab = "Time (in days)", palette = c("red", "blue"),
  legend = c(0.8, 0.9), legend.title = element_blank(), legend.labs = c("APS High", "APS Low")
)

p_surv_TMB[[1]] <- ggsurvplot(km_TMB$VanAllen,
  pval = TRUE, fun = "pct",
  xlab = "Time (in days)", palette = c("red", "blue"),
  legend = c(0.8, 0.9), legend.title = element_blank(), legend.labs = c("TMB High", "TMB Low")
)
p_surv_TMB[[2]] <- ggsurvplot(km_TMB$Hugo,
  pval = TRUE, fun = "pct",
  xlab = "Time (in days)", palette = c("red", "blue"),
  legend = c(0.8, 0.9), legend.title = element_blank(), legend.labs = c("TMB High", "TMB Low")
)
p_surv_TMB[[3]] <- ggsurvplot(km_TMB$Snyder,
  pval = TRUE, fun = "pct",
  xlab = "Time (in days)", palette = c("red", "blue"),
  legend = c(0.8, 0.9), legend.title = element_blank(), legend.labs = c("TMB High", "TMB Low")
)

p_surv_TIGS[[1]] <- ggsurvplot(km_TIGS$VanAllen,
  pval = TRUE, fun = "pct",
  xlab = "Time (in days)", palette = c("red", "blue"),
  legend = c(0.8, 0.9), legend.title = element_blank(), legend.labs = c("TIGS High", "TIGS Low")
)
p_surv_TIGS[[2]] <- ggsurvplot(km_TIGS$Hugo,
  pval = TRUE, fun = "pct",
  xlab = "Time (in days)", palette = c("red", "blue"),
  legend = c(0.8, 0.4), legend.title = element_blank(), legend.labs = c("TIGS High", "TIGS Low")
)
p_surv_TIGS[[3]] <- ggsurvplot(km_TIGS$Snyder,
  pval = TRUE, fun = "pct",
  xlab = "Time (in days)", palette = c("red", "blue"),
  legend = c(0.8, 0.9), legend.title = element_blank(), legend.labs = c("TIGS High", "TIGS Low")
)

p_surv_TIDE[[1]] <- ggsurvplot(km_TIDE$VanAllen,
  pval = TRUE, fun = "pct",
  xlab = "Time (in days)", palette = c("red", "blue"),
  legend = c(0.8, 0.9), legend.title = element_blank(), legend.labs = c("TIDE High", "TIDE Low")
)
p_surv_TIDE[[2]] <- ggsurvplot(km_TIDE$Hugo,
  pval = TRUE, fun = "pct",
  xlab = "Time (in days)", palette = c("red", "blue"),
  legend = c(0.8, 0.9), legend.title = element_blank(), legend.labs = c("TIDE High", "TIDE Low")
)
p_surv_TIDE[[3]] <- ggsurvplot(km_TIDE$Snyder,
  pval = TRUE, fun = "pct",
  xlab = "Time (in days)", palette = c("red", "blue"),
  legend = c(0.8, 0.9), legend.title = element_blank(), legend.labs = c("TIDE High", "TIDE Low")
)


p_aps <- plot_grid(plotlist = lapply(p_surv_APS, function(x) x$plot), ncol = 3)
save_plot(filename = "aps_kmplot.pdf", plot = p_aps, ncol = 3, base_aspect_ratio = 1.3)
# ggpar(p , font.legend = c(12, "bold", "black"))

Save as pdf for publication.

Session info

sessionInfo()
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: CentOS Linux 7 (Core)
#> 
#> Matrix products: default
#> BLAS:   /home/public/R/R-base/lib64/R/lib/libRblas.so
#> LAPACK: /home/public/R/R-base/lib64/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] grid      stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] cowplot_1.0.0       survminer_0.4.6     pROC_1.15.3        
#>  [4] ggrepel_0.8.1       forestplot_1.9      checkmate_1.9.4    
#>  [7] survival_3.1-7      ggpubr_0.2.3        magrittr_1.5       
#> [10] scales_1.0.0        corrplot_0.84       pheatmap_1.0.12    
#> [13] forcats_0.4.0       stringr_1.4.0       purrr_0.3.3        
#> [16] tidyr_1.0.0         tibble_2.1.3        ggplot2_3.2.1      
#> [19] tidyverse_1.2.1     readr_1.3.1         dplyr_0.8.3        
#> [22] UCSCXenaTools_1.2.7 rmdformats_0.3.5    knitr_1.26         
#> [25] pacman_0.5.1       
#> 
#> loaded via a namespace (and not attached):
#>  [1] utf8_1.1.4             questionr_0.7.0        R.utils_2.9.0         
#>  [4] tidyselect_0.2.5       lme4_1.1-21            robust_0.4-18.1       
#>  [7] htmlwidgets_1.5.1      munsell_0.5.0          codetools_0.2-16      
#> [10] DT_0.10                future_1.15.0          miniUI_0.1.1.1        
#> [13] withr_2.1.2            Brobdingnag_1.2-6      metaBMA_0.6.2         
#> [16] colorspace_1.4-1       highr_0.8              rstudioapi_0.10       
#> [19] stats4_3.6.0           DescTools_0.99.30      robustbase_0.93-5     
#> [22] rcompanion_2.3.7       ggsignif_0.6.0         listenv_0.7.0         
#> [25] labeling_0.3           emmeans_1.4.2          rstan_2.19.2          
#> [28] KMsurv_0.1-5           mnormt_1.5-5           MCMCpack_1.4-4        
#> [31] bridgesampling_0.7-2   coda_0.19-3            vctrs_0.2.0           
#> [34] generics_0.0.2         TH.data_1.0-10         metafor_2.1-0         
#> [37] xfun_0.11              R6_2.4.0               BayesFactor_0.9.12-4.2
#> [40] reshape_0.8.8          logspline_2.1.15       assertthat_0.2.1      
#> [43] promises_1.1.0         multcomp_1.4-10        ggExtra_0.9           
#> [46] gtable_0.3.0           multcompView_0.1-7     globals_0.12.4        
#> [49] processx_3.4.1         mcmc_0.9-6             sandwich_2.5-1        
#> [52] rlang_0.4.1            MatrixModels_0.4-1     EMT_1.1               
#> [55] zeallot_0.1.0          splines_3.6.0          TMB_1.7.15            
#> [58] lazyeval_0.2.2         broom_0.5.2            inline_0.3.15         
#> [61] abind_1.4-5            yaml_2.2.0             reshape2_1.4.3        
#> [64] modelr_0.1.5           crosstalk_1.0.0        backports_1.1.5       
#> [67] httpuv_1.5.2           tools_3.6.0            bookdown_0.15         
#> [70] psych_1.8.12           ellipsis_0.3.0         RColorBrewer_1.1-2    
#> [73] WRS2_1.0-0             ez_4.4-0               Rcpp_1.0.3            
#>  [ reached getOption("max.print") -- omitted 100 entries ]

Supplementary analyses

Random selection of APM genes or IIS genes

TCGA contains about 10,000 patients with RNASeq data, it is hard to simulate GSVA serveral times with whole data (too much computation), thus we using a sampling strategy as following:

  • randomly select RNAseq data from 500 patients.
  • calculate GSVA score
    • calculate once from right genes.
    • randomly select genes (10 times) and run GSVA.
  • calculate spearman correlation coefficient for normal APS and normal IIS, random APS and normal IIS, normal APS and random IIS, respectively. The latter two have ten values, we use their mean.
  • repeat the process 100 times.

The GSVA score calculation is implemented by script code/random_simulation.R (this will take much time to finish). Next we calculate correlation using following commands.

Now, we plot spearman correlation efficient of 3 types.

We can clearly see that if we change gene list for APS or IIS calculation, the strong positive correlation between APS and IIS will no longer exist.

Explore correlation between ORR and TIGS from TCGA data

A reviewer commented on our analysis which showed in section “TIGS definition and performance”:

it appears that the authors lumped together data from different data sets (e.g. reference 26 for tumor mutational burden and TCGA data for expression)

This analysis we adopted same idea from Tumor Mutational Burden and Response Rate to PD-1 inhibition. To further check reliability of this analysis, here we use gene expression and TMB data from TCGA and ORR data from clinical literatures.

Firstly we load TCGA data and calculate TIGS score.

Next we calculate median of APS, TMB, TIGS and their corresponding sample counts for each tumor type.

We then save this data and merge them with ORR data according to corresponding tumor types.

After merging data, we load data into R.

Now we plot figures.

TMB

Significant correlation is observed.

R and R square can be given as:

Shixiang Wang

Xue-Song Liu (Corresponding author)

2019-11-16