Copy number alteration fingerprint predicts the clinical response of oxaliplatin-based chemotherapy in metastatic colorectal cancer


EQUAL CONTRIBUTION

J.W., J.W., Z.T. and T.W. contributed equally to this work as joint first authors.

X.L., X.L. and X.S.L contributed equally to this work as joint senior authors.

Introduction

Oxaliplatin-based chemotherapy is routinely used in metastatic colorectal cancer (mCRC) patient, however only about half of patients show clinical response after therapy. Accurate biomarkers for predicting oxaliplatin chemotherapy clinical response are still lacking. Here we developed and validated a tumor genomic DNA copy number alteration (CNA) based biomarker.

In total 271 mCRC patients who underwent primary tumor resection and postoperatively treated with oxaliplatin-based chemotherapy are included in this study. Tissue samples and their clinical data were obtained from the Fudan University Shanghai Cancer Center (FUSCC) and Tongji Hospital of Tongji University. The sample selection criteria are detailed here.

PART 1: Data preprocessing

To develop a stable, effective, and cost-efficient predictive biomarker for oxaliplatin-based chemotherapy response in metastatic colorectal cancer patients, we utilized shallow whole-genome sequencing(sWGS) data to extract copy number alteration(CNA) features. These features, combined with clinical characteristics, were used to construct a machine learning model, ultimately resulting in the development of a highly predictive model known as the CNA fingerprint.

1.1 Raw data processing

suppose i is the sample name. Using follow tools:

fastp

Quality control.

fastp -i $workdir/$fq1 -I $workdir/$fq2 \
-o ${i}_fp_1.fq.gz -O  ${i}_fp_2.fq.gz 
-h ${i}_fastp.html \
-w 10

BWA MEM

Mapping.

bwa mem -M -R "@RG\tID:$i\tLM:$i\tSM:$i\tPL:illumina\tPU:$i" 
-t 8 \
/path_to_hg38_genome/GRCh38.d1.vd1.fa \
$workdir/${i}_fp_1.fq.gz \
$workdir/${i}_fp_2.fq.gz > $workdir/${i}.sam

Samtools

Transform sam to bam file.

samtools view -@ 8 -bS $workdir/${i}.sam > $workdir/${i}_unsort.bam

Picard

Sort the bam file.

java -jar -Xmx12G -Djava.io.tmpdir=/path_to_tmp/tmp \
/path_to_picard-2.20.6-0/picard.jar SortSam \
I=$workdir/${i}_unsort.bam \
O=$workdir/${i}_sorted.bam \
SORT_ORDER=coordinate

1.2 CNA features calling

suppose you have got your bam file. Using follow tools:

QDNAseq

CBC segment,the defult genome is hg19,here we use hg38,and binsize as 100.

library(QDNAseq)
library(QDNAseq.hg38)
library(stringr)
library(ACE)
library(DNAcopy)

args <- commandArgs(trailingOnly = TRUE)
#args 1,sample name
#args 2,path to bam
#args 3,the output path

i <- args[1]
s <- paste0(i,"_sorted.bam")
#binsize: 100
bins <- getBinAnnotations(binSize=100,genome="hg38")
readCounts <- binReadCounts(bins,bamfile=paste0(args[2],"/",s))
readCountsFiltered <- applyFilters(readCounts,residual=TRUE, blacklist=TRUE)
readCountsFiltered <- estimateCorrection(readCountsFiltered)
copyNumbers <- correctBins(readCountsFiltered)
copyNumbersNormalized <- normalizeBins(copyNumbers)
copyNumbersSmooth <- smoothOutlierBins(copyNumbersNormalized)
copyNumbersSegmented <- segmentBins(copyNumbersSmooth, transformFun="log2")
copyNumbersSegmented <- normalizeSegmentedBins(copyNumbersSegmented)

save(copyNumbersSegmented,file="/your_path/copyNumbersSegmented.rda")

you can run like:

Rscript QDNAseq.R $sampleid $path/yourbamfile $outdir

ACE

Fit the cellularity and ploidy

sqm <- squaremodel(copyNumbersSegmented, QDNAseqobjectsample = 1, 
                   penalty = 0.65, penploidy = 0.65, 
                   ptop = 5.3, pbottom = 1.8, prows = 250,
                   cellularities = seq(50, 100, length.out = 50),
                   sgc = c("X", "Y"),exclude=c())

# Here we get the purity ploidy and error values of all the fits, referring to the official ACE document that states “the minimum error is not necessarily the best fit”, it is recommended to select combinations with higher cellular purity in the similar minimum error set, and at the same time select combinations with higher ploidy based on the a priori knowledge of the genomes of colon cancer.

mdf <- sqm$minimadf
mdf <- mdf[order(mdf$error,-mdf$cellularity),]

# Manual screening and determination ploidy and cellularity,for example:

ploidy = 7.6
cellularity = 0.8


# Absoult CNA calling
m<-ACEcall(copyNumbersSegmented, QDNAseqobjectsample = TRUE, scellularity = cellularity,
           ploidy = ploidy, standard =1, plot = FALSE,
           qcutoff = -3, subcutoff = 0.2, trncname = FALSE,
           cap = 12, bottom = 0, 
           onlyautosomes = TRUE
           )

segment <- data.frame(copyNumbersSegmented@featureData@data[1:nrow(m),1:3],segment_mean=m$segment_mean)
segment <- na.omit(segment)
segment$segment_mean <- round(segment$segment_mean)

Sigminer

CN features calling

here we got the CNA segment files,first we do a filter,delect all segments with num_marks <10,make the file colnames as: chr,start,end segval,sample

We extracted CNA features described in our previous studies(2021 Wang et.al, 2023 Tao et.al) and also global CNA status indicator: CNA burden, amplification/deletion of each chromosome.

data <- readRDS("../data/exampleSeg.rds")

CNF_call <- function(data,note = NULL){
  library(sigminer)
  CN <- data[,c("chromosome", "start", "end", "segVal","sample")]
  
  # Wang et al.
  cn <- read_copynumber(CN,
                        seg_cols = c("chromosome", "start", "end", "segVal"),
                        samp_col = "sample",
                        genome_build = "hg38", complement = FALSE)
  #
  cn_sum <- cn@summary.per.sample
  rownames(cn_sum) <- cn_sum$sample
  tally_w <- sig_tally(cn, method = "W", cores = 10)
  if(!is.null(note)){
     saveRDS(tally_w,file = paste0("./data/results/",note,"_tally_W.rds"))
  }
  
  cn_fet <- tally_w[["nmf_matrix"]]
  feature_m <- data.frame(cn_fet)
  colnames(feature_m) <- colnames(cn_fet)
  feature_m$sample <- rownames(cn_fet)
  # if only one sample,rownames(cn_fet)will be null
  colon_cnf <- merge(cn_sum,feature_m,by = "sample", all = T)
  
  # Yao et al.
  #HRD(cutoff 0.2)
  library(HRDCNA)
  score_yhz <- HRDprediction(data = feature_m)
  colon_cnf <- merge(colon_cnf,score_yhz,by = "sample", all = T)
  
  # Tao et al.
  tally_X_noLOH <- sig_tally(cn,
                             method = "X",
                             add_loh = FALSE,
                             cores = 10
  )
  if(!is.null(note)){
    saveRDS(tally_X_noLOH, file = paste0("./data/results/",note,"_tally_X_noLOH.rds"))
  }
  
  cn_fet <- tally_X_noLOH[["all_matrices"]][["simplified_matrix"]]
  feature_m <- data.frame(cn_fet)
  colnames(feature_m) <- colnames(cn_fet)
  feature_m$sample <- rownames(cn_fet)
  
  colon_cnf <- merge(colon_cnf,feature_m,by = "sample", all = T)
  
  # global CNA status(amp and del states of every chr)
  gCNA <- data.frame(matrix(nrow = 0,ncol = 44))
  coln <- c()
  for(i in 1:22) {
    coln <- c(coln, paste0("chr", i, "[AMP]"), paste0("chr", i, "[DEL]"))
  }
  chrname <- paste0("chr",1:22)
  
  for(i in unique(data$sample)){
    df <- data[data$sample==i,]
    onechr <- c(i)
    for(j in chrname){
      df1<- df[df$chromosome==j,]
      if(any(df1$segVal >2)){
        amp=1
      }else{
        amp=0
      }
      
      if(any(df1$segVal <2)){
        del=1
      }else{
        del=0
      }
      onechr <- c(onechr,amp,del)
    }
    onechr <- data.frame(t(onechr))
    colnames(onechr) <- c("sample",coln)
    
    gCNA <- rbind(gCNA,onechr)
  }
  colon_cnf <- merge(colon_cnf,gCNA,by="sample")
  return(colon_cnf)
}

features <- CNF_call(data,note = NULL)

saveRDS(features,file = "../data/examplefeatures.rds")

1.3 Feature distribution

TCGA data as an example, the distribution(W method and X method) of its features is demonstrated by function show_catalogue.

library(sigminer)
ace_tally_W <- readRDS("../data/ace_tally_W.rds")
tcga_tally_W <- readRDS("../data/tcga_tally_W.rds")
show_catalogue(t(ace_tally_W[, -1]), style = "cosmic", mode = "copynumber", x_label_angle = 90)
show_catalogue(t(tcga_tally_W[, -1]), style = "cosmic", mode = "copynumber", x_label_angle = 90)

ace_tally_X <- readRDS("../data/ace_tally_X.rds")
tcga_tally_X <- readRDS("../data/tcga_tally_X.rds")
show_catalogue(t(ace_tally_X[, -1]), style = "cosmic", mode = "copynumber", method = "T", x_label_angle = 90)
show_catalogue(t(tcga_tally_X[, -1]), style = "cosmic", mode = "copynumber", method = "T", x_label_angle = 90)

The CNA feature distribution in TCGA mCRCs.

The CNA feature distribution in Train cohort.

The CNA feature distribution in Train cohort.

The CNA feature distribution in Test1,Test2 and Test3 cohort can be presented in the same way.

PART 2: Model training

2.1 Internal benchmark

We performed internal benchmark to compare the performance of different CNA feature sets and 4 different machine learning models ( extreme gradient boosting (XGBoost), neural networks (NNET), random forest (RF), light gradient boosting machine (LGBM) ) in the training dataset.

The CNA information we compare here is the CNA signature reported by Drew et.al and Tao et.al.

Example calling process of CNA signature reported by Drews et.al.

# devtools::install_github("markowetzlab/CINSignatureQuantification", build_vignettes = TRUE, dependencies = TRUE)
# input format
# chromosome    start   end segVal  sample
# 1 61735   249224388   3.1 TCGA-BT-A20P

library(tidyverse)
library(CINSignatureQuantification)

data <- readRDS("../data/exampleSeg.rds")

mySigs = quantifyCNSignatures(swgs_segment.ace_train)
D_sig <- mySigs@activities[["rawAct0"]]

#Low segment counts: Features not computed for 8 of 121 samples - see `getSampleFeatures()`

Example fitting process of CNA signature reported by Tao et.al.

# download pcawg_cn_sigs_CN136_SA from https://github.com/XSLiuLab/Pan-cancer_CNA_signature/data
examplefeatures <- readRDS("../data/examplefeatures.rds")

data <- examplefeatures[,c(89:224)]
m <- t(data)
colnames(m) <- examplefeatures$sample

T_sig <- sig_fit(as.matrix(m),
                      sig = pcawg_cn_sigs_CN136_SA$Signature,
                      method="SA",
                      return_class="matrix"
)

2.2 Model Training

We perform model parameter tuning via grid search, considering the vast number of combinations, it would be advisable to first conduct parameter searches for a subset, then utilize the optimized subset to search for other parameters, rather than running all parameters simultaneously.

Here, we omit the step of parameter searching and detailed data for other models, just directly present all the considered parameters.

  • We use scale_pos_weight_value to address the issue of class imbalance. The weight is set as the sum of negative samples divided by the sum of positive samples.
library(xgboost)
library(caret)
library(Matrix)

train.mat <- readRDS("../data/train_mat.rds")
dtrain_mat <- train.mat
colnames(dtrain_mat)[1:(dim(dtrain_mat)[2] - 1)] <- paste0("feature", seq(1:(dim(dtrain_mat)[2] - 1)))
dtrain_input <- sparse.model.matrix(res ~ ., data = dtrain_mat)[, -1] %>%
  as.matrix()
train_label <- dtrain_mat$res == "response"
control <- trainControl(
  method = "cv",
  number = 5,
  classProbs = TRUE,
  summaryFunction = twoClassSummary
)

scale_pos_weight_value <- sum(dtrain_mat3$res == "nonresponse") / sum(dtrain_mat3$res == "response")

param_grid <- expand.grid(
  max_depth = c(2:15),
  eta = c(1e-5, 1e-4, 0.001, 0.01, 0.1),
  max_delta_step = c(1:30),
  alpha = c(0, 0.1, 0.5),
  lambda = c(0, 1, 0.5),
  gamma = c(0, 1, 5, 10),
  scale_pos_weight = scale_pos_weight_value,
  colsample_bytree = seq(0.1, 1, 0.1),
  subsample = seq(0.5, 1, 0.1), 
  min_child_weight = c(1:10) 
)
paramlist <- apply(param_grid, 1, as.list)

for (i in seq_along(paramlist)) {
  x <- paramlist[[i]] 
  tryCatch(
    {
      set.seed(333) 
      xgb_last <- xgboost(
        data = dtrain_input,
        label = train_label,
        objective = "binary:logistic",
        params = x %>% as.list(), 
        nrounds = 1000,
        early_stopping_rounds = 20,
        eval_metric = "auc"
      )
      imp <- xgb.importance(model = xgb_last) %>%
        as.data.frame() %>%
        dplyr::arrange(desc(Gain))
      headimp <- head(imp, n = 20)
      headimp$Feature
      ## ---------------------------------------------------
      set.seed(333) 
      out <- xgb.cv(
        data = dtrain_input[, colnames(dtrain_input) %in% headimp$Feature] %>% as.matrix(),
        label = train_label,
        params = x %>% as.list(),
        nrounds = 1000,
        showsd = TRUE,
        booster = "gbtree",
        print_every_n = 5,
        early_stopping_rounds = 100,
        metrics = c("auc", "aucpr"),
        verbose = TRUE,
        nfold = 5,
        objective = "binary:logistic"
      )

      result <- out[["evaluation_log"]]
      result$best_nround <- out$best_iteration
      result$params <- out$params %>% list()
      result$params_input <- list(x)
      result$i <- i
      print(x)
      print(paste0("Here is ", i))
      if (max(out$evaluation_log$test_auc_mean) > 0.80) {
        saveRDS(
          result,
          file = paste0("../data/iteronehot_", i, "_tune_param.rds")
        )
        saveRDS(
          out,
          file = paste0("../data/iteronehot_", i, "_out.rds")
        )
      }
    },
    error = function(e) {
      message("Error occurred with parameters: ", x, ". Error message: ", e$message)
      list(error_occurred = TRUE, params = x)
    }
  )
}

2.3 5-fold CV

The best model is selected based on cross-validation results. The parameters of final model as followed.

cvmodel <- readRDS("../data/iteronehot_1587_out.rds")
print(paste0("Best iteration : ",cvmodel$best_iteration))
[1] "Best iteration : 11"
cvparam <- readRDS("../data/iteronehot_1587_tune_param.rds")
# params_input:12
cvparam[[12]][[11]]
$nrounds
[1] 100

$max_depth
[1] 3

$eta
[1] 0.1

$max_delta_step
[1] 8

$alpha
[1] 0

$lambda
[1] 1

$gamma
[1] 5

$scale_pos_weight
[1] 1.2

$colsample_bytree
[1] 0.7

$subsample
[1] 0.6

$min_child_weight
[1] 1

Attention, it has been tested that even with the same seed number, there may still be differences in feature importance output between different versions of the tool. To ensure result reproducibility, it is recommended to run this section using the same version, or alternatively, directly apply the final model provided by us.

library(gridExtra)
library(ggplot2)

CV5_evaluate <- cvmodel[["evaluation_log"]] %>%
  head(50)

## AUROC
# Fit a curve by calculating the starting and ending positions of the bar chart.
fit_y1 <- predict(loess(train_auc_mean ~ iter, data = CV5_evaluate))
fit_y2 <- predict(loess(test_auc_mean ~ iter, data = CV5_evaluate))

df_bar.auroc <- data.frame(
  x = CV5_evaluate$iter,
  y1 = fit_y1,
  y2 = fit_y2,
  bar_y1 = CV5_evaluate$train_auc_std,
  bar_y2 = CV5_evaluate$test_auc_std
)

iterauc_plot <- ggplot(CV5_evaluate, aes(x = iter)) +
  geom_smooth(aes(y = train_auc_mean, color = "Train"), method = "loess", se = FALSE, size = 0.5) +
  geom_smooth(aes(y = test_auc_mean, color = "Test"), method = "loess", se = FALSE, size = 0.5) +
  geom_point(aes(y = train_auc_mean), color = "grey", alpha = 0) +
  geom_point(aes(y = test_auc_mean), color = "grey", alpha = 0) +
  geom_segment(data = df_bar.auroc, aes(x = x, xend = x, y = y1 - bar_y1 / 2, yend = y1 + bar_y1 / 2), color = "grey", size = 0.5) +
  geom_segment(data = df_bar.auroc, aes(x = x, xend = x, y = y2 - bar_y2 / 2, yend = y2 + bar_y2 / 2), color = "grey", size = 0.5) +
  scale_color_manual(values = c("Train" = "#005691", "Test" = "#ff9a3c")) +
  labs(x = "Iterations", y = "Area Under the ROC Curve", color = "Lines") +
  theme_bw() +
  ylim(0.6, 1) +
  #scale_y_continuous(breaks = seq(0, 1, by = 0.2), limits = c(0, 1)) + 
  geom_vline(xintercept = 16, linetype = "dashed", color = "red") 

## AUCPR
fit_y1_pr <- predict(loess(train_aucpr_mean ~ iter, data = CV5_evaluate))
fit_y2_pr <- predict(loess(test_aucpr_mean ~ iter, data = CV5_evaluate))

df_bar.aucpr <- data.frame(
  x = CV5_evaluate$iter,
  y1 = fit_y1,
  y2 = fit_y2,
  bar_y1 = CV5_evaluate$train_aucpr_std,
  bar_y2 = CV5_evaluate$test_aucpr_std
)

iteraucpr_plot <- ggplot(CV5_evaluate, aes(x = iter)) +
  geom_smooth(aes(y = train_aucpr_mean, color = "Train"), method = "loess", se = FALSE, linewidth = 0.5) +
  geom_smooth(aes(y = test_aucpr_mean, color = "Test"), method = "loess", se = FALSE, linewidth = 0.5) +
  geom_point(aes(y = train_aucpr_mean), color = "grey", alpha = 0) +
  geom_point(aes(y = test_aucpr_mean), color = "grey", alpha = 0) +
  geom_segment(data = df_bar.aucpr, aes(x = x, xend = x, y = y1 - bar_y1 / 2, yend = y1 + bar_y1 / 2), color = "grey", size = 0.5) +
  geom_segment(data = df_bar.aucpr, aes(x = x, xend = x, y = y2 - bar_y2 / 2, yend = y2 + bar_y2 / 2), color = "grey", size = 0.5) +
  scale_color_manual(values = c("Train" = "#005691", "Test" = "#ff9a3c")) +
  labs(x = "Iterations", y = "Area Under the Precision-Recall Curve", color = "Lines") +
  theme_bw() +
  ylim(0.6, 1) +
  #scale_y_continuous(breaks = seq(0, 1, by = 0.2), limits = c(0, 1)) + 
  geom_vline(xintercept = 16, linetype = "dashed", color = "red") 

CV results. Red line indicated the best iteration. Grey line indicated the standard deviation.

grid.arrange(iterauc_plot, iteraucpr_plot, ncol = 2)

library(xgboost)

dtrain_input <- readRDS("../data/dtrain_input.rds")
train_label <- readRDS("../data/train_label.rds")

set.seed(333)
xgb_last_all <- xgboost(
  data = dtrain_input,
  label = train_label,
  objective = "binary:logistic",
  params = cvparam[[12]][[11]] %>% as.list(), 
  nrounds = 1000,
  early_stopping_rounds = 20,
  eval_metric = c("auc")
)
saveRDS(xgb_last_all, file = "../data/xgb_last_all.rds")

2.4 Feature selection and Final model

The feature importances in the model are ranked based on gains.

rm(list = ls())

library(tidyverse)
library(xgboost)
library(ggplot2)

xgb_last_all <- readRDS("../data/xgb_last_all.rds")

importance_all <- xgb.importance(model = xgb_last_all) %>%
  as.data.frame() %>%
  dplyr::arrange(desc(Gain)) %>%
  head(n = 20L) %>%
  dplyr::mutate(
    Feature = case_when(
      Feature == "feature84" ~ "CN[>8]",
      Feature == "feature15" ~ "cna_burden",
      Feature == "feature7" ~ "AFP",
      Feature == "feature12" ~ "chr[20]AMP",
      Feature == "feature70" ~ "NC50[>7]",
      Feature == "feature74" ~ "CNCP[3]",
      Feature == "feature81" ~ "CN[2]", 
      Feature == "feature5" ~ "CA19-9",
      Feature == "feature42" ~ "L:LL:9+:BB",
      Feature == "feature60" ~ "E:HH:2:BB",
      Feature == "feature75" ~ "CNCP[>4 & <=8]",
      Feature == "feature52" ~ "E:LL:9+:BB",
      Feature == "feature47" ~ "E:LL:3:AA",
      Feature == "feature59" ~ "E:HH:2:AA",
      Feature == "feature3有" ~ "Cancerous node",
      Feature == "feature53" ~ "E:LL:5-8:BB",
      Feature == "feature73" ~ "CNCP[1]",
      Feature == "feature38" ~ "L:HH:2:AA",
      Feature == "feature83" ~ "CN[>4 & <=8]",
      Feature == "feature16" ~ "SS[>8]",
      .default = Feature
    )
  )
[17:53:02] WARNING: src/learner.cc:553: 
  If you are loading a serialized model (like pickle in Python, RDS in R) generated by
  older XGBoost, please export the model by calling `Booster.save_model` from that version
  first, then load it back in current version. See:

    https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html

  for more details about differences between saving model and serializing.
import_allgg <- ggplot(importance_all, aes(x = Gain, y = reorder(Feature, Gain))) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(x = "Feature importance", y = "") +
  theme_bw() +
  theme(
    panel.grid.major = element_blank(), 
    panel.background = element_blank() 
  ) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

import_allgg

Next, we identify optimal number of features in prediction model. For combination of topN features (N range from 2 to 20), we calculated AUC from 50 times repeated five-fold cross-validation.

imp <- xgb.importance(model = xgb_last_all) %>%
  as.data.frame() %>%
  dplyr::arrange(desc(Gain))

rp_auc_df <- data.frame()
for(rp in 1:50){
  set.seed(rp)
  auc_df <- data.frame()
  for(i in 2:20){
    cv_results <- xgb.cv(
      data = dtrain_input[, imp$Feature[1:i]] %>% as.matrix(),
      label = train_label,
      objective = "binary:logistic",
      params = cvparam[[12]][[11]] %>% as.list(),
      nrounds = 100,
      nfold = 5,
      early_stopping_rounds = 20,
      metrics = "auc",
      verbose = TRUE,
      seed = rp
    )
    # Ignore data from the first 5 epoch to minimize error
    averageauc <- mean(tail(cv_results$evaluation_log$test_auc_mean, -5))
    averageaucsd <- mean(tail(cv_results$evaluation_log$test_auc_std, -5))
    
    df <- data.frame(rp=rp,rank=i,auc=averageauc,sd=averageaucsd)
    auc_df <- rbind(auc_df,df)
    print(i)
  }
  rp_auc_df <- rbind(rp_auc_df,auc_df)
  print(rp)
}

save(rp_auc_df,file = "../data/rp_auc_df.rda")

The feature combination with the highest average AUC values in the training dataset

load("../data/rp_auc_df.rda")

ggplot(rp_auc_df, aes(x = factor(rank), y = auc)) + 
  geom_violin(trim = FALSE, fill =  "#3fbac2") +
  geom_boxplot(width = 0.1, position = position_dodge(0.9), color = "#4d606e") +
  labs(title = "Mean AUC of 5 fold CV(50 times) for different fatures",
       x = "Rank",
       y = "RO AUC") +
  theme_bw()+
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())

When N = 8, the model has the best average auc and sd. At this point, we get the final model.

set.seed(123)

xgb_last <- xgboost(
  data = dtrain_input[, imp$Feature[1:8]] %>% as.matrix(),
  label = train_label,
  objective = "binary:logistic",
  params = cvparam[[12]][[11]] %>% as.list(),
  nrounds = 1000,
  early_stopping_rounds = 100,
  eval_metric = c("auc","aucpr")
)
saveRDS(xgb_last, file = "../data/xgb_last.rds")

2.5 Distribution of important features

Distribution of important features in different datasets.

library(ggpubr)
library(patchwork)
library(ggsci)

feature <- c("CN[>8]","E:LL:9+:BB","L:LL:9+:BB","batch","lable")

train <- read.csv("../data/Train.csv",check.names = F) %>%
  select(all_of(feature))
train$batch <- "Train"

test <- read.csv("../data/all_test.csv",check.names = F) %>%
  select(all_of(feature))

df <- rbind(train,test)

df$lable <- ifelse(df$lable==1,"response","Non response")
df$batch <- factor(df$batch,levels = c("Train","Test1","Test2","Test3"))

plots <- list()
for (col in names(df)[1:3]) {
p<-ggplot(df,aes_string(x = "batch", y =  paste0("`", col, "`"), color = "lable"))+
  geom_boxplot(aes(fill=lable),
               alpha=0.1,outlier.shape = NA)+
  geom_jitter(position = position_jitterdodge(jitter.height=0.3,
                                              jitter.width = 0.3,
                                              dodge.width = 0.5),
    size = 0.4)+ 
  scale_color_manual(values = c("#1e56a0","#f95959"))+
  scale_fill_manual(values = c("#1e56a0","#f95959"))+
  theme_bw()+
  theme(panel.grid = element_blank(),
        axis.text.x = element_text(angle = 45, hjust = 1))+
  stat_compare_means(aes(group = lable), 
                     method ="wilcox.test",
                              label="p.signif",
                              show.legend = F)+
  xlab("")

plots[[col]] <- p
}

combined_plot <- wrap_plots(plots, ncol = 3)+
  plot_layout(guides = "collect") & theme(legend.position = "top") 

combined_plot

PART 3: Model evaluation

Final Model Evaluation

The performance of CNA fingerprint on three independent test sets.

AUC,PRAUC

Defining figure plotting functions.

library(precrec)
library(dplyr)
library(ggplot2)
library(ggpubr)
library(pROC)

preform <- function(score,i){
  score <- score[which(!is.na(score[,i])),]
  score$lable <- factor(score$lable,levels = c(0,1))
  pre_obj1 <- mmdata(score[,i], score$lable,posclass = 1)
  pre_obj1 <- evalmod(pre_obj1)
  auc <- precrec::auc(pre_obj1)
  pre1_df <- fortify(pre_obj1)
  #if(i=="prediction"){i="OXACNAscore"}
  if(i=="Aneuploidy_score"){i="Aneuploidy Score"}
  #if(i=="ploidy"){i="WGD"}
  pre1_df$Dataset <- i
  pre1_df$auc <- round(auc$aucs[1],2)
  pre1_df$prc <- round(auc$aucs[2],2)
  return(pre1_df)
}

jyplot <- function(list,type){
  num <- data.frame(matrix(ncol = 3, nrow = length(list)))
  for (i in 1:length(list)) {
    num[i,1] <- list[[i]]$Dataset[1]
    num[i,2] <- list[[i]]$auc[1]
    num[i,3] <- list[[i]]$prc[1]
  }
  #
  performance_df <- Reduce(rbind, list)
  cols <- c("CNAfingerprint" = "#ea5455",
            "HRDCNAScore" = "#3fbac2",
            "scarHRD" = "#cabbe9",
            "Aneuploidy Score"="#769fcd")
  if(type=="auc"){
    roc <- performance_df[performance_df$curvetype == "ROC",]
    p <- ggplot(roc, aes(x=x, y=y, group = Dataset)) +
      theme_bw() +
      geom_line(aes(color = Dataset),linewidth=1.2) +
      xlab("1-Specificity") +
      ylab("Sensitivity") +
      theme(plot.title = element_text(hjust = 0.5),
            panel.grid=element_blank(),
            # line = element_line(color = "black", linewidth = 1.5,
            #                     linetype = 1, lineend = "butt"),
            axis.text.x  =  element_text(size=14,color = "black"),
            axis.text.y = element_text(size=14,color = "black"),
            axis.line = element_line(colour="black"),
            legend.position = "none",
            title=element_text(size=14),
      ) +
      scale_color_manual(values =  cols)+
      guides(color=guide_legend(title="Dataset"))+
      annotate("segment", x = 0, y = 0,xend = 1, yend = 1,
               color = "gray", size = 0.5, linetype = "dashed")+
      annotate("text",x = .6, y = .26,size=3, colour="#ea5455",
               label=paste(num[1,1],"RO AUC =",num[1,2])) +
      annotate("text",x = .6, y = .19,size=3, colour="#3fbac2",
               label=paste(num[2,1],"RO AUC =",num[2,2]))+
      annotate("text",x = .6, y = .12,size=3, colour="#cabbe9",
               label=paste(num[3,1],"RO AUC =",num[3,2]))+
      annotate("text",x = .6, y = .05,size=3, colour="#769fcd",
               label=paste(num[4,1],"RO AUC =",num[4,2]))
    
  }else if(type=="prc"){
    
    prc <- performance_df[performance_df$curvetype == "PRC",]
    
    p <- ggplot(prc, aes(x=x, y=y, group = Dataset)) +
      theme_bw() +
      geom_line(aes(color = Dataset),linewidth=1)+
      xlab("Recall") +
      ylab("Precision") +
      theme(plot.title = element_text(hjust = 0.5),
            panel.grid=element_blank(),
            # line = element_line(color = "black", linewidth = 1.5,
            #                     linetype = 1, lineend = "butt"),
            axis.text.x  =  element_text(size=14,color = "black"),
            axis.text.y = element_text(size=14,color = "black"),
            axis.line = element_line(colour="black"),
            legend.position = "none",
            title=element_text(size=14),
            legend.text = element_text(size=14),
            legend.title = element_text(size = 14))+
      coord_cartesian(xlim = c(0,1), ylim = c(0,1))+
      scale_color_manual(values =  cols)+
      annotate("segment", x = 0, y = 1,xend = 1, yend = 0, 
               color = "gray", size = 0.5, linetype = "dashed")+
      annotate("text",x = .4, y = .26,size=3, colour="#ea5455",
               label=paste(num[1,1],"PR AUC =",num[1,3])) +
      annotate("text",x = .4, y = .19,size=3, colour="#3fbac2",
               label=paste(num[2,1],"PR AUC =",num[2,3]))+
      annotate("text",x = .4, y = .12,size=3, colour="#cabbe9",
               label=paste(num[3,1],"PR AUC =",num[3,3]))+
      annotate("text",x = .4, y = .05,size=3, colour="#769fcd",
               label=paste(num[4,1],"PR AUC =",num[4,3]))
  }
  
  return(p)
}

score_turn <- function(data,type){
  if(type == "soft"){
    #data$ploidy <- max(data$ploidy)-data$ploidy
    data$HRDCNAScore <- 1-data$HRDCNAScore
    data$scarHRD <- max(data$scarHRD)-data$scarHRD
    data$Aneuploidy_score <- (max(data$Aneuploidy_score)-data$Aneuploidy_score)/39
  }
  if(type== "hard"){
    data$CNAfingerprint <- ifelse(data$CNAfingerprint>0.33,1,0)
    #data$ploidy <- ifelse(data$ploidy>=4,0,1)
    data$HRDCNAScore <- ifelse(data$HRDCNAScore>0.2,0,1)
    data$scarHRD <- (data$scarHRD- min(data$scarHRD))/(max(data$scarHRD)-min(data$scarHRD))
    data$scarHRD <- 1-data$scarHRD
    scarcut <- quantile(data$scarHRD,probs=0.75)
    data$scarHRD <- ifelse(data$scarHRD > scarcut,1,0)
    #
    data$Aneuploidy_score <- data$Aneuploidy_score/39
    as_cut <- quantile(data$Aneuploidy_score,probs=0.75)
    data$Aneuploidy_score <- ifelse(data$Aneuploidy_score >as_cut,0,1)
    data$`Tumor site` <- ifelse(data$`Tumor site`== 'Right side',0,1)
  }
  return(data)
}

Take the example of calculating the RO AUC and PR AUC for Test1 cohort.

data <- read.csv("../data/all_test.csv",check.names = F)

#
data <- data[which(data$batch=="Test1"),]

data_auc <- score_turn(data,"soft")

preform_list <- list()
for(i in c("CNAfingerprint","HRDCNAScore","scarHRD","Aneuploidy_score")){
  a <- preform(data_auc,i)
  preform_list <- c(preform_list, list(a))
}

#auc prc
library(patchwork)
p <- jyplot(preform_list,"auc")
p2 <- jyplot(preform_list,"prc")
p + p2

The ROAUC and PRAUC for all test cohorts.

for no-oxaliplatin cohort,we also get the AUC.

noOxa_cohort <- readRDS("../data/noOxa_cohort.rds")
data <- na.omit(noOxa_cohort)
data$lable <- ifelse(data$lable=="PD",0,1)

plot.roc(
  data$lable,
  data$CNAfingerprint,
  col = "#3d84a8",
  percent = TRUE,
  lwd = 2.5,
  print.auc = TRUE,
  print.auc.cex = 0.7,
  print.auc.pattern = paste0("CNA fingerprint",": %.1f%%"),
  print.auc.y = 35
)

PART 4: Biomarker comparison

The performance of CNA fingerprint in oxaliplatin response prediction was compared with: –HRD status –KRAS mutation –aneuploidy –primary tumor location.

4.1 HRD status and aneuploidy

We use R package HRDCNA(Using the threshold values 0.2 recommended by the tool guideline. see 1.2 CNA features calling, function CNF_call) and scarHRD to assess the HRD status score;

# library(devtools)
# install_bitbucket('sequenza_tools/sequenza')
# install_github('sztup/scarHRD',build_vignettes = TRUE)
# Note that this R package installation is detailed at https://github.com/sztup/scarHRD

library(scarHRD)

#input format
#SampleID   Chromosome  Start_position  End_position    total_cn    A_cn    B_cn    ploidy
#SamplePatient1       chr1          14574       952448        5    0    5

exampleSeg <- readRDS("../data/exampleSeg.rds")

# Assuming a ploidy of 5
ploidy=5
data <- data.frame(exampleSeg[,c(5,1:4)],
                   A_cn= exampleSeg$segVal,
                   B_cn = 0,
                   ploidy = ploidy)

colnames(data) <- c("SampleID","Chromosome","Start_position","End_position",
                    "total_cn","A_cn","B_cn","ploidy")

data %>%
  group_by(SampleID) %>%
  group_split() %>% 
  walk(~ {
    group_name <- unique(.x$SampleID)
    # save as csv
    write.table(.x, file = paste0("../data/scarHRD/",group_name, ".txt"), 
                row.names = FALSE,quote = F,sep = "\t")
  })

csv_files <- list.files(path = "../data/scarHRD/", 
                        pattern = "\\.txt$", full.names = TRUE)
score <- data.frame()
for(file in csv_files){
  sample <- gsub(".txt","",basename(file))
  scar_train <- scar_score(file,reference = "grch38", seqz=FALSE)
  df <- data.frame(scar_train,sample=sample)
  score <- rbind(score,df)
}


# "sample"  "HRD"   "Telomeric AI"  "LST"   "HRD-sum"
# "sample"  0   13  4   17

R package AneuploidyScore was used to calculate the aneuploidy score (total number of arm level gains/losses).

Note that this R package only supports hg19, so we first need to perform a genome version conversion

# devtools::install_github("quevedor2/aneuploidy_score", ref='master')
library(GenomicRanges)
library(liftOver)
library(AneuploidyScore)

cytoarm <- cytobandToArm(ucsc.hg19.cytoband)
ch <- import.chain('../data/hg38ToHg19.over.chain')

exampleSeg <- readRDS("../data/exampleSeg.rds")
# Assuming a ploidy of 5
ploidy=5

out <- data.frame()

for (i in unique(exampleSeg$sample)) {
  segdata <- exampleSeg[exampleSeg$sample==i,]
  #ploidy=pp[which(pp$sample==i),"polidy"]
  ploidy=5
  seg <- GRanges(
    #seqnames = paste0("chr",segdata$Chromosome),
    seqnames = segdata$chromosome,
    ranges = IRanges(start = segdata$start,end = segdata$end),
    strand = "*",
    TCN = segdata$segVal
  )
  hg19.seg <- liftOver(seg, ch)
  seg_caa <- getCAA(hg19.seg, cytoarm, tcn_col="TCN", classifyCN=TRUE,
                    ploidy=ploidy, threshold=0.5)
  caa <- reduceArms(seg_caa, caa_method=c('arm', 'seg'),
                    arm_ids = c('p', 'q'))
  
  result <- data.frame(sample=i,
                       t(caa[,1]),
                       aneuploidy_score = colSums(abs(caa), na.rm = TRUE))
  out <- rbind(out,result[1,])
}
#Removal of sex chromosomes
aneuploidy_score <- rowSums(abs(out[,2:45]),na.rm = TRUE)
aneuploidy_score <- data.frame(sample=out$sample,AS=aneuploidy_score)

4.2 Comparison and application

library(caret)

all_test <- read.csv("../data/all_test.csv",check.names = F)
# from change
score_turn <- function(data,type){
  if(type == "soft"){
    data$HRDCNAScore <- 1-data$HRDCNAScore
    data$scarHRD <- max(data$scarHRD)-data$scarHRD
    data$Aneuploidy_score <- (max(data$Aneuploidy_score)-data$Aneuploidy_score)/39
  }
  if(type== "hard"){
    data$CNAfingerprint <- ifelse(data$CNAfingerprint>0.33,1,0)
    data$HRDCNAScore <- ifelse(data$HRDCNAScore>0.2,0,1)
    data$scarHRD <- (data$scarHRD- min(data$scarHRD))/(max(data$scarHRD)-min(data$scarHRD))
    data$scarHRD <- 1-data$scarHRD
    scarcut <- quantile(data$scarHRD,probs=0.75)
    data$scarHRD <- ifelse(data$scarHRD > scarcut,1,0)
    #
    data$Aneuploidy_score <- data$Aneuploidy_score/39
    as_cut <- quantile(data$Aneuploidy_score,probs=0.75)
    data$Aneuploidy_score <- ifelse(data$Aneuploidy_score > as_cut,0,1)
    data$`Tumor site` <- ifelse(data$`Tumor site`== 'Right side',0,1)
  }
  return(data)
}


score <- c("CNAfingerprint","HRDCNAScore","scarHRD","Aneuploidy_score","Tumor site")

table <- data.frame(matrix(nrow = 12))

for (set in unique(all_test$batch)) {
  data <- all_test[which(all_test$batch==set),]
  data <- score_turn(data,type="hard")
  for (i in score) {
    df <- data[,c("lable",i)]
    colnames(df)[2] <- "score"
    cm <- caret::confusionMatrix(data=factor(df$score,levels = c(0,1)),
                         reference=factor(df$lable,levels = c(0,1)),
                         positive="1")
    
    list <- as.data.frame(cm$byClass)
    
    list["Accuracy",1] <- cm$overall[1]
    colnames(list)[1] <- paste0(set,"_",i)
    
    table <- cbind(table,list)
  }
}

table <- table[,-1]
saveRDS(table,file = "../data/allscore_CMtable.rds")

Heat maps

library(pheatmap)
allscore_CMtable <- readRDS("../data/allscore_CMtable.rds")

data <- t(allscore_CMtable[c(12,1,2,5:7),])

pheatmap(data,
         cluster_rows=F,
         cluster_cols = F,
         angle_col = "45",
         color = colorRampPalette(c("#3e92a3", "white", "#ff5335"))(100),
         border_color = "white",
         number_format = "%.2f",
         display_numbers = round(data, 2),
         legend=T
         )

4.3 Compare to KRAS mutation

Limited by sample size, we will observe the predictive effect of KRAS mutations on chemotherapy response in all samples annotated with KRAS mutations in the 271 sample.

A total of 95 patients had KRAS mutations and 117 patients did not detect any KRAS,BRAF or NRAS mutations

data <- read.csv("../data/mut.csv")
data$type <- ifelse(data$Response.to.chemotherapy=="PD",0,1)
data$mut <- ifelse(data$Hotspot.mutation=="WT",1,0)

cm <- caret::confusionMatrix(data=factor(data$type),
                             reference=factor(data$mut),positive="1")

print(paste0("Accuracy of KRASmutation is ",cm$overall[1]))
[1] "Accuracy of KRASmutation is 0.518867924528302"
#
cm$byClass
         Sensitivity          Specificity       Pos Pred Value 
           0.5128205            0.5263158            0.5714286 
      Neg Pred Value            Precision               Recall 
           0.4672897            0.5714286            0.5128205 
                  F1           Prevalence       Detection Rate 
           0.5405405            0.5518868            0.2830189 
Detection Prevalence    Balanced Accuracy 
           0.4952830            0.5195682 

PART 5: Overall survival

Exploring the Predictive Performance of CNA fingerprint for Patient Overall Survival.

Test3 cohort doesn’t have enough information for overall survival.

library(survival)
library(survminer)
library(tidyr)

Overall Survival of test1 and test2 cohort

library(dplyr)

test_all <- read.csv("../data/all_test.csv",check.names = F)

data <- test_all[,-1] %>%
  filter(batch=="Test1") %>%
  mutate(prediction = factor(ifelse(CNAfingerprint > 0.33,"High CNAfingerprint","Low CNAfingerprint"),
                             levels = c("Low CNAfingerprint","High CNAfingerprint")),
         OS = OS/30)

fit <- survfit(Surv(`OS`, `OS_event`) ~ prediction, data =na.omit(data[,c(21:23)]))

ggsurvplot(
  fit,
  data = data[,c(21:23)],
  size = 1,
  conf.int = TRUE,
  pval = TRUE,
  pval.method = TRUE,
  risk.table = TRUE,
  palette =c("#3d6cb9", "#bc5148"),
  risk.table.height = 0.25,
  ggtheme = theme_classic2(),
  xlab = "Months"
)

data <- test_all[,-1] %>%
  filter(batch=="Test2") %>%
  mutate(prediction = factor(ifelse(CNAfingerprint > 0.33,"High CNAfingerprint","Low CNAfingerprint"),
                             levels = c("Low CNAfingerprint","High CNAfingerprint")),
         OS = OS/30)

fit <- survfit(Surv(`OS`, `OS_event`) ~ prediction, data =na.omit(data[,c(21:23)]))

ggsurvplot(
  fit,
  data = data[,c(21:23)],
  size = 1,
  conf.int = TRUE,
  pval = TRUE,
  pval.method = TRUE,
  risk.table = TRUE,
  palette =c("#3d6cb9", "#bc5148"),
  risk.table.height = 0.25,
  ggtheme = theme_classic2(),
  xlab = "Months"
)

Overall Survival of no-oxaliplatin cohort

noOxa_cohort <- readRDS("../data/noOxa_cohort.rds")
data <- na.omit(noOxa_cohort)
data$group <- factor(ifelse(data$CNAfingerprint > 0.33,"High CNAfingerprint","Low CNAfingerprint"),
                     levels = c("Low CNAfingerprint","High CNAfingerprint"))

fit <- survfit(Surv(`OS`, `OS_event`) ~ group, data =data)


ggsurvplot(
  fit,
  data = data,
  size = 1,
  conf.int = TRUE,
  pval = TRUE,
  pval.method = TRUE,
  risk.table = TRUE,
  palette =c("#3d6cb9", "#bc5148"),
  risk.table.height = 0.25,
  ggtheme = theme_classic2(),
  xlab = "Months"
)

Acknowledgments

We thank Fudan University Shanghai Cancer Center and Tongji hospital of Tongji University for sample collection;

We thank ShanghaiTech University High Performance Computing Public Service Platform for computing services. We thank multi-omics facility, molecular and cell biology core facility of ShanghaiTech University for technical help.

LICENSE

If you want to reuse the code in this report, please note the license below should be followed.

The code is made available for non commercial research purposes only under the MIT. However, notwithstanding any provision of the MIT License, the software currently may not be used for commercial purposes without explicit written permission after contacting Jinyu Wang or Tiyu Tao or Xue-Song Liu .