Antigen presentation and tumor immunogenicity in cancer immunotherapy response prediction

Please note this work under Apache License v2 license, copyright belongs to Shixiang Wang and Xue-Song Liu. And the study has been applied for a national patent in China.

This document is compiled from an Rmarkdown file which contains all code or description necessary to reproduce the analysis for the accompanying project. Each section below describes a different component of the analysis and all numbers and figures are generated directly from the underlying data on compilation.


The preprocessing step is dependent on R software and some R packages, if you have not used R yet or R is not installed, please see

R packages:

  • UCSCXenaTools - download data from UCSC Xena
  • GEOquery - download data from NCBI GEO database
  • tidyverse - operate data, plot
  • data.table - operate data
  • survival - built in R, used to do survival analysis
  • metafor, metawho - meta-analysis
  • forestmodel - generate forestplot for meta-analysis model
  • forestplot - plot forestplot
  • survminer - plot survival fit
  • pROC - ROC analysis and visualization
  • TCGAmutations - download TCGA mutation data
  • DT - show data table as a table in html
  • GSVA - GSVA algorithm implementation
  • ggstatsplot - plot scatter with linear fit
  • corrplot - plot correlation
  • knitr, rmdformats - used to compile this file
  • readxl - read xlsx data
  • Some other dependent packages

These R packages are easily searched by internet and installed from either CRAN or Bioconductor, they have no strict version requirements to reproduce the following analyses.

Data download and preprocessing

This part will clearly describe how to process raw data but not provide an easy script for re-running all preprocess steps just by a click due to the complexity of data preprocessing.

TCGA Pan-cancer data download

TCGA Pan-cancer data (Version 2017-10-13), including datasets of clinical informaiton, gene expression, are downloaded from UCSC Xena via R package UCSCXenaTools.

UCSCXenaTools is developed by Shixiang and it is an R package to robustly access data from UCSC Xena data hubs. More please see paper: Wang et al., (2019). The UCSCXenaTools R package: a toolkit for accessing genomics data from UCSC Xena platform, from cancer multi-omics to single-cell RNA-seq. Journal of Open Source Software, 4(40), 1627,

If you have not installed this package yet, please run following command in R.

Load the package with:

Obtain datasets information available at TCGA data hubs of UCSC Xena:

Obtain clinical datasets of TCGA:

Obtain gene expression datasets of TCGA:

Create data queries and download them:

The RNASeq data we downloaded are pancan normalized.

For comparing data within independent cohort (like TCGA-LUAD), we recommend to use the “gene expression RNAseq” dataset. For questions regarding the gene expression of this particular cohort in relation to other types tumors, you can use the pancan normalized version of the “gene expression RNAseq” data. For comparing with data outside TCGA, we recommend using the percentile version if the non-TCGA data is normalized by percentile ranking. For more information, please see our Data FAQ: here.

These datasets are downloaded to local machine, we need to load them into R. However, whether filenames of datasets or contents in datasets all look messy, next we need to clean them before real analysis.

TCGA pan-cancer data clean

Clean filenames

First, clean filenames.


Obtain TCGA project abbreviations from and read as a data.frame in R.

Compare difference.

Remove TCGA.FPPP.sampleMap__FPPP_clinicalMatrix.gz which has no RNAseq data. Remove other 3 datasets which merge more than one TCGA study, only keep individual study from TCGA.

Clean clinical datasets

Now, we read TCGA clinical datasets and clean them.

#> ── Attaching packages ─────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
#> ✔ tibble  2.1.3     ✔ stringr 1.4.0
#> ✔ tidyr   1.0.0     ✔ forcats 0.4.0
#> ── Conflicts ────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
# keep valuable columns of clinical datasets
select_cols <- c(
  "sampleID", "OS", "OS.time", "OS.unit", "RFS", "RFS.time", "RFS.unit",
  "age_at_initial_pathologic_diagnosis", "gender", "tobacco_smoking_history",
  "tobacco_smoking_history_indicator", "sample_type", "pathologic_M",
  "pathologic_N", "pathologic_T", "pathologic_stage"

# read clinical files
Cli_DIR <- paste0(TCGA_DIR, "/Clinical")

# Load and preprocess clinical data
Clinical_List <- XenaPrepare(paste0(Cli_DIR, "/", Clinical_filelist2))
#>  [1] "TCGA.ACC.sampleMap__ACC_clinicalMatrix.gz"  
#>  [2] "TCGA.BLCA.sampleMap__BLCA_clinicalMatrix.gz"
#>  [3] "TCGA.BRCA.sampleMap__BRCA_clinicalMatrix.gz"
#>  [4] "TCGA.CESC.sampleMap__CESC_clinicalMatrix.gz"
#>  [5] "TCGA.CHOL.sampleMap__CHOL_clinicalMatrix.gz"
#>  [6] "TCGA.COAD.sampleMap__COAD_clinicalMatrix.gz"
#>  [7] "TCGA.DLBC.sampleMap__DLBC_clinicalMatrix.gz"
#>  [8] "TCGA.ESCA.sampleMap__ESCA_clinicalMatrix.gz"
#>  [9] "TCGA.GBM.sampleMap__GBM_clinicalMatrix.gz"  
#> [10] "TCGA.HNSC.sampleMap__HNSC_clinicalMatrix.gz"
#> [11] "TCGA.KICH.sampleMap__KICH_clinicalMatrix.gz"
#> [12] "TCGA.KIRC.sampleMap__KIRC_clinicalMatrix.gz"
#> [13] "TCGA.KIRP.sampleMap__KIRP_clinicalMatrix.gz"
#> [14] "TCGA.LAML.sampleMap__LAML_clinicalMatrix.gz"
#> [15] "TCGA.LGG.sampleMap__LGG_clinicalMatrix.gz"  
#> [16] "TCGA.LIHC.sampleMap__LIHC_clinicalMatrix.gz"
#> [17] "TCGA.LUAD.sampleMap__LUAD_clinicalMatrix.gz"
#> [18] "TCGA.LUSC.sampleMap__LUSC_clinicalMatrix.gz"
#> [19] "TCGA.MESO.sampleMap__MESO_clinicalMatrix.gz"
#> [20] "TCGA.OV.sampleMap__OV_clinicalMatrix.gz"    
#> [21] "TCGA.PAAD.sampleMap__PAAD_clinicalMatrix.gz"
#> [22] "TCGA.PCPG.sampleMap__PCPG_clinicalMatrix.gz"
#> [23] "TCGA.PRAD.sampleMap__PRAD_clinicalMatrix.gz"
#> [24] "TCGA.READ.sampleMap__READ_clinicalMatrix.gz"
#> [25] "TCGA.SARC.sampleMap__SARC_clinicalMatrix.gz"
#> [26] "TCGA.SKCM.sampleMap__SKCM_clinicalMatrix.gz"
#> [27] "TCGA.STAD.sampleMap__STAD_clinicalMatrix.gz"
#> [28] "TCGA.TGCT.sampleMap__TGCT_clinicalMatrix.gz"
#> [29] "TCGA.THCA.sampleMap__THCA_clinicalMatrix.gz"
#> [30] "TCGA.THYM.sampleMap__THYM_clinicalMatrix.gz"
#> [31] "TCGA.UCEC.sampleMap__UCEC_clinicalMatrix.gz"
#> [32] "TCGA.UCS.sampleMap__UCS_clinicalMatrix.gz"  
#> [33] "TCGA.UVM.sampleMap__UVM_clinicalMatrix.gz"
sub("TCGA\\.(.*)\\.sampleMap.*", "\\1", names(Clinical_List)) -> project_code

TCGA_Clinical <- tibble()
# for loop
for (i in 1:length(project_code)) {
  clinical <- names(Clinical_List)[i]
  project <- project_code[i]
  df <- Clinical_List[[clinical]]
  col_exist <- select_cols %in% colnames(df)
  res <- tibble()
  if (!all(col_exist)) {
    res <- df[, select_cols[col_exist]]
    res[, select_cols[!col_exist]] <- NA
  } else {
    res <- df[, select_cols]
  res$Project <- project
  res %>% select(Project, select_cols) -> res
  TCGA_Clinical <- bind_rows(TCGA_Clinical, res)

rm(res, df, i, clinical, project, col_exist) # remove temp variables

View the clinical data, check variables.