Copy number alteration fingerprint predicts the clinical response of oxaliplatin-based chemotherapy in metastatic colorectal cancer

EQUAL CONTRIBUTION

J.W., J.W., Z.T. and T.W. contributed equally to this work as joint first authors.

X.L., X.L. and X.S.L contributed equally to this work as joint senior authors.

Introduction

Oxaliplatin-based chemotherapy is routinely used in metastatic colorectal cancer (mCRC) patient, however only about half of patients show clinical response after therapy. Accurate biomarkers for predicting oxaliplatin chemotherapy clinical response are still lacking. Here we developed and validated a tumor genomic DNA copy number alteration (CNA) based biomarker.

In total 271 mCRC patients who underwent primary tumor resection and postoperatively treated with oxaliplatin-based chemotherapy are included in this study. Tissue samples and their clinical data were obtained from the Fudan University Shanghai Cancer Center (FUSCC) and Tongji Hospital of Tongji University. The sample selection criteria are detailed here.

PART 1: Data preprocessing

To develop a stable, effective, and cost-efficient predictive biomarker for oxaliplatin-based chemotherapy response in metastatic colorectal cancer patients, we utilized shallow whole-genome sequencing(sWGS) data to extract copy number alteration(CNA) features. These features, combined with clinical characteristics, were used to construct a machine learning model, ultimately resulting in the development of a highly predictive model known as the CNA fingerprint.

1.1 Raw data processing

suppose i is the sample name. Using follow tools:

fastp

Quality control.

fastp -i $workdir/$fq1 -I $workdir/$fq2 \
-o ${i}_fp_1.fq.gz -O  ${i}_fp_2.fq.gz 
-h ${i}_fastp.html \
-w 10

BWA MEM

Mapping.

bwa mem -M -R "@RG\tID:$i\tLM:$i\tSM:$i\tPL:illumina\tPU:$i" 
-t 8 \
/path_to_hg38_genome/GRCh38.d1.vd1.fa \
$workdir/${i}_fp_1.fq.gz \
$workdir/${i}_fp_2.fq.gz > $workdir/${i}.sam

Samtools

Transform sam to bam file.

samtools view -@ 8 -bS $workdir/${i}.sam > $workdir/${i}_unsort.bam

Picard

Sort the bam file.

java -jar -Xmx12G -Djava.io.tmpdir=/path_to_tmp/tmp \
/path_to_picard-2.20.6-0/picard.jar SortSam \
I=$workdir/${i}_unsort.bam \
O=$workdir/${i}_sorted.bam \
SORT_ORDER=coordinate

1.2 CNA features calling

suppose you have got your bam file. Using follow tools:

QDNAseq

CBC segment,the defult genome is hg19,here we use hg38,and binsize as 100.

library(QDNAseq)
library(QDNAseq.hg38)
library(stringr)
library(ACE)
library(DNAcopy)

args <- commandArgs(trailingOnly = TRUE)
# args 1，sample name
# args 2，path to bam
# args 3，the output path

i <- args[1]
s <- paste0(i, "_sorted.bam")
# binsize: 100
bins <- getBinAnnotations(binSize = 100, genome = "hg38")
readCounts <- binReadCounts(bins, bamfile = paste0(args[2], "/", s))
readCountsFiltered <- applyFilters(readCounts, residual = TRUE, blacklist = TRUE)
readCountsFiltered <- estimateCorrection(readCountsFiltered)
copyNumbers <- correctBins(readCountsFiltered)
copyNumbersNormalized <- normalizeBins(copyNumbers)
copyNumbersSmooth <- smoothOutlierBins(copyNumbersNormalized)
copyNumbersSegmented <- segmentBins(copyNumbersSmooth, transformFun = "log2")
copyNumbersSegmented <- normalizeSegmentedBins(copyNumbersSegmented)

save(copyNumbersSegmented, file = "/your_path/copyNumbersSegmented.rda")

you can run like:

Rscript QDNAseq.R $sampleid $path/yourbamfile $outdir

ACE

Fit the cellularity and ploidy

sqm <- squaremodel(copyNumbersSegmented, QDNAseqobjectsample = 1, 
                   penalty = 0.65, penploidy = 0.65, 
                   ptop = 5.3, pbottom = 1.8, prows = 250,
                   cellularities = seq(50, 100, length.out = 50),
                   sgc = c("X", "Y"),exclude=c())

# Here we get the purity ploidy and error values of all the fits, referring to the official ACE document that states “the minimum error is not necessarily the best fit”, it is recommended to select combinations with higher cellular purity in the similar minimum error set, and at the same time select combinations with higher ploidy based on the a priori knowledge of the genomes of colon cancer.

mdf <- sqm$minimadf
mdf <- mdf[order(mdf$error,-mdf$cellularity),]

# Manual screening and determination ploidy and cellularity,for example:

ploidy = 7.6
cellularity = 0.8


# Absoult CNA calling
m<-ACEcall(copyNumbersSegmented, QDNAseqobjectsample = TRUE, scellularity = cellularity,
           ploidy = ploidy, standard =1, plot = FALSE,
           qcutoff = -3, subcutoff = 0.2, trncname = FALSE,
           cap = 12, bottom = 0, 
           onlyautosomes = TRUE
           )

segment <- data.frame(copyNumbersSegmented@featureData@data[1:nrow(m),1:3],segment_mean=m$segment_mean)
segment <- na.omit(segment)
segment$segment_mean <- round(segment$segment_mean)

Sigminer

CN features calling

here we got the CNA segment files,first we do a filter,delect all segments with num_marks <10,make the file colnames as: chr,start,end segval,sample

We extracted CNA features described in our previous studies(2021 Wang et.al, 2023 Tao et.al) and also global CNA status indicator: CNA burden (proportion of CNA in the genome18); Chromosomal amplification (AMP) or deletion (DEL) ratio; Chromosomal AMP/DEL level (diploid-referenced change magnitude).

data <- readRDS("../data/exampleSeg.rds")

CNF_call <- function(data, note = NULL) {
  library(sigminer)
  CN <- data[, c("chromosome", "start", "end", "segVal", "sample")]

  # Wang et al.
  cn <- read_copynumber(CN,
    seg_cols = c("chromosome", "start", "end", "segVal"),
    samp_col = "sample",
    genome_build = "hg38", complement = FALSE
  )
  #
  cn_sum <- cn@summary.per.sample
  rownames(cn_sum) <- cn_sum$sample
  tally_w <- sig_tally(cn, method = "W", cores = 10)
  if (!is.null(note)) {
    saveRDS(tally_w, file = paste0("./data/results/", note, "_tally_W.rds"))
  }

  cn_fet <- tally_w[["nmf_matrix"]]
  feature_m <- data.frame(cn_fet)
  colnames(feature_m) <- colnames(cn_fet)
  feature_m$sample <- rownames(cn_fet)
  # if only one sample,rownames(cn_fet)will be null
  colon_cnf <- merge(cn_sum, feature_m, by = "sample", all = T)

  # Yao et al.
  # HRD(cutoff 0.2)
  library(HRDCNA)
  score_yhz <- HRDprediction(data = feature_m)
  colon_cnf <- merge(colon_cnf, score_yhz, by = "sample", all = T)

  # Tao et al.
  tally_X_noLOH <- sig_tally(cn,
    method = "X",
    add_loh = FALSE,
    cores = 10
  )
  if (!is.null(note)) {
    saveRDS(tally_X_noLOH, file = paste0("./data/results/", note, "_tally_X_noLOH.rds"))
  }

  cn_fet <- tally_X_noLOH[["all_matrices"]][["simplified_matrix"]]
  feature_m <- data.frame(cn_fet)
  colnames(feature_m) <- colnames(cn_fet)
  feature_m$sample <- rownames(cn_fet)

  colon_cnf <- merge(colon_cnf, feature_m, by = "sample", all = T)

  # global CNA status(amp and del states of every chr)
  data$chromosome <- ifelse(startsWith(data$chromosome, "chr"),
    data$chromosome,
    paste0("chr", data$chromosome)
  )
  gCNA <- data.frame(matrix(nrow = 0, ncol = 88))
  coln <- c()
  for (i in 1:22) {
    coln <- c(
      coln, paste0("chr[", i, "]AMPratio"),
      paste0("chr[", i, "]AMPlevel"),
      paste0("chr[", i, "]DELratio"),
      paste0("chr[", i, "]DELlevel")
    )
  }
  chrname <- paste0("chr", 1:22)

  for (i in unique(data$sample)) {
    df <- data[data$sample == i, ]
    onechr <- c(i)
    for (j in chrname) {
      df1 <- df[df$chromosome == j, ]
      totallength <- sum(df1$end - df1$start)
      if (any(df1$segVal > 2)) {
        al <- df1[which(df1$segVal > 2), 3] - df1[which(df1$segVal > 2), 2]
        ampr <- sum(al) / totallength
        ampl <- sum(al * df1[which(df1$segVal > 2), 4]) / sum(al * 2)
      } else {
        ampr <- 0
        ampl <- 0
      }

      if (any(df1$segVal < 2)) {
        dl <- df1[which(df1$segVal < 2), 3] - df1[which(df1$segVal < 2), 2]
        delr <- sum(dl) / totallength
        dell <- sum(dl * df1[which(df1$segVal < 2), 4]) / sum(dl * 2)
      } else {
        delr <- 0
        dell <- 0
      }
      onechr <- c(
        onechr,
        round(ampr, 4),
        round(ampl, 4),
        round(delr, 4),
        round(dell, 4)
      )
    }
    onechr <- data.frame(t(onechr))
    colnames(onechr) <- c("sample", coln)

    gCNA <- rbind(gCNA, onechr)
  }
  colon_cnf <- merge(colon_cnf, gCNA, by = "sample")

  colon_cnf[, 2:312] <- lapply(colon_cnf[, 2:312], as.numeric)

  return(colon_cnf)
}
features <- CNF_call(data, note = NULL)

saveRDS(features, file = "../data/examplefeatures.rds")

1.3 Feature distribution

TCGA data as an example, the distribution(W method and X method) of its features is demonstrated by function show_catalogue.

library(sigminer)
ace_tally_W <- readRDS("../data/ace_tally_W.rds")
tcga_tally_W <- readRDS("../data/tcga_tally_W.rds")
show_catalogue(t(ace_tally_W[, -1]), style = "cosmic", mode = "copynumber", x_label_angle = 90)
show_catalogue(t(tcga_tally_W[, -1]), style = "cosmic", mode = "copynumber", x_label_angle = 90)

ace_tally_X <- readRDS("../data/ace_tally_X.rds")
tcga_tally_X <- readRDS("../data/tcga_tally_X.rds")
show_catalogue(t(ace_tally_X[, -1]), style = "cosmic", mode = "copynumber", method = "T", x_label_angle = 90)
show_catalogue(t(tcga_tally_X[, -1]), style = "cosmic", mode = "copynumber", method = "T", x_label_angle = 90)

The CNA feature distribution in TCGA mCRCs.

The CNA feature distribution( W method ) in Train cohort.

The CNA feature distribution( X method ) in Train cohort.

The CNA feature distribution in Test1,Test2 and Test3 cohort can be presented in the same way.

PART 2: Model training

2.1 Internal benchmark

We performed internal benchmark to compare the performance of different CNA feature sets and 5 different machine learning models (1) extreme gradient boosting (XGBoost), (2) neural networks (NNET), (3) random forest (RF), (4) Naive bayes, (5) light gradient boosting machine (LGBM) in the training dataset.

The CNA information we compare here is the CNA signature reported by Drew et.al

Example calling process of CNA signature reported by Drews et.al.

# devtools::install_github("markowetzlab/CINSignatureQuantification", build_vignettes = TRUE, dependencies = TRUE)
# input format
# chromosome    start   end segVal  sample
# 1 61735   249224388   3.1 TCGA-BT-A20P

library(tidyverse)
library(CINSignatureQuantification)

data <- readRDS("../data/exampleSeg.rds")

mySigs <- quantifyCNSignatures(swgs_segment.ace_train)
D_sig <- mySigs@activities[["rawAct0"]]

# Low segment counts: Features not computed for 8 of 121 samples - see `getSampleFeatures()`

All feature-model combinations were feature-selected and trained and then banchmarked by five-fold cross-validation. here we only show the training process of XGBoost in next section.

2.2 Model Training

We train the model with the help of mlr3. Create our own task:

library(mlr3verse)
library(ggplot2)

train_cnf <- readRDS("../data/train_cnf.rds")

data <- train_cnf %>%
  mutate(lable = ifelse(train_cnf$lable == "PD", 0, 1))
colnames(data)[1:(ncol(data) - 1)] <- paste0("feature", 1:(ncol(data) - 1))

# keep the feature names
featurename <- colnames(train_cnf)[1:(ncol(train_cnf) - 1)]
names(featurename) <- paste0("feature", 1:(ncol(data) - 2))
saveRDS(featurename, file = "../data/featurename.rds")

# task
tsk_train <- as_task_classif(data, target = "lable", positive = "1")

we use the property “importance” of XGBoost for perform initial feature selection.

set.seed(2025)
learner <- lrn("classif.xgboost", predict_type = "prob")
filter <- flt("importance", learner = learner)
filter$calculate(tsk_train)
diffeaure <- imp$feature

tsk_train$select(diffeaure)

Then, tsk_train keep 90 features, we use a grid search to determine the parameter: “nrounds”.

set.seed(2025)
learner <- lrn("classif.xgboost",
  predict_type = "prob", eta = 0.1,
  nrounds = to_tune(80, 250),
  nthread = 20
)

instance <- tune(
  tuner = mlr3tuning::tnr("grid_search", resolution = 170, batch_size = 10),
  task = tsk_train,
  learner = learner,
  resampling = rsmp("cv", folds = 5),
  measures = msr("classif.auc"),
  terminator = trm("stagnation", iters = 200, threshold = 0.05)
)

# results get nrounds = 132
instance$result_learner_param_vals

# other parameters
set.seed(2025)
learner <- lrn("classif.xgboost",
  predict_type = "prob", nrounds = 132,
  nthread = 20
)

search_space <- ps(
  eta = p_dbl(lower = .001, upper = .2),
  max_depth = p_int(lower = 1, upper = 10),
  # max_delta_step = p_int(lower = 1,upper = 10),
  alpha = p_dbl(lower = 0, upper = 0.5),
  lambda = p_dbl(lower = 0, upper = 1),
  gamma = p_int(lower = 0, upper = 6),
  min_child_weight = p_int(lower = 1, upper = 10),
  subsample = p_dbl(lower = .8, upper = 1)
  # colsample_bytree = p_dbl( lower = .5, upper = 1)
  # colsample_bylevel = p_dbl(lower = .5, upper = 1)
)

instance <- tune(
  tuner = mlr3tuning::tnr("random_search", batch_size = 10),
  task = tsk_train,
  learner = learner,
  search_space = search_space,
  resampling = rsmp("cv", folds = 5),
  measures = msr("classif.auc"),
  terminator = trm("stagnation", iters = 200, threshold = 0.01)
)

# Model parameter update
set.seed(2025)
learner$param_set$values <- instance$result_learner_param_vals

For other parameter mediation, given the large number of combinations, we use “random_search”.you can also use”grid_search”,but it would be advisable to first conduct parameter searches for a subset, then utilize the optimized subset to search for other parameters, rather than running all parameters simultaneously.

After we update the parameters,the model also performs feature selection,

importance <- as.data.table(learner$importance(), keep.rownames = TRUE)
colnames(importance) <- c("Feature", "Importance")
importance$Featurename <- featurename[importance$Feature]

Next, we adjust again based on these 7 features, the adjustment process is the same as above.

diffeaure <- importance$Feature
saveRDS(diffeaure, file = "../data/xgb_feature7.rds")
tsk_train$select(diffeaure)

set.seed(2025)
learner <- lrn("classif.xgboost",
  predict_type = "prob", eta = 0.1,
  nrounds = to_tune(80, 250),
  nthread = 20
)

instance <- tune(
  tuner = mlr3tuning::tnr("grid_search", resolution = 170, batch_size = 10),
  task = tsk_train,
  learner = learner,
  resampling = rsmp("cv", folds = 5),
  measures = msr("classif.auc"),
  terminator = trm("stagnation", iters = 200, threshold = 0.05)
)

# results get nrounds = 83
instance$result_learner_param_vals

# other parameters
set.seed(2025)
learner <- lrn("classif.xgboost",
  predict_type = "prob", nrounds = 83,
  nthread = 20
)

search_space <- ps(
  eta = p_dbl(lower = .001, upper = .2),
  max_depth = p_int(lower = 1, upper = 10),
  # max_delta_step = p_int(lower = 1,upper = 10),
  alpha = p_dbl(lower = 0, upper = 0.5),
  lambda = p_dbl(lower = 0, upper = 1),
  gamma = p_int(lower = 0, upper = 6),
  min_child_weight = p_int(lower = 1, upper = 10),
  subsample = p_dbl(lower = .8, upper = 1)
  # colsample_bytree = p_dbl( lower = .5, upper = 1)
  # colsample_bylevel = p_dbl(lower = .5, upper = 1)
)

instance <- tune(
  tuner = mlr3tuning::tnr("random_search", batch_size = 10),
  task = tsk_train,
  learner = learner,
  search_space = search_space,
  resampling = rsmp("cv", folds = 5),
  measures = msr("classif.auc"),
  terminator = trm("stagnation", iters = 200, threshold = 0.01)
)

2.4 Final model

Update the model again to get the final model.

# Model parameter update
set.seed(2025)
learner$param_set$values <- instance$result_learner_param_vals

learner$train(tsk_train)

saveRDS(learner, file = "../data/xgboostle.rds")

AUC and PR AUC of the model on the training set.

trainres <- learner$predict(tsk_train)

auc_value <- trainres$score(msr("classif.auc"))
autoplot(trainres, type = "roc") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray") +
  geom_text(aes(x = 0.6, y = 0.2, label = paste("XGBoost AUC = ", round(auc_value, 4))),
    color = "black", size = 4
  ) +
  theme_minimal() +
  theme(
    panel.grid = element_blank(),
    panel.border = element_rect(color = "black", fill = NA, size = 1),
    plot.title = element_text(size = 10, face = "bold"),
    axis.title = element_text(size = 10),
    axis.text = element_text(size = 10),
    legend.position = "bottom"
  ) +
  labs(title = "XGBoost ROC Curve")


auc_value <- trainres$score(msr("classif.prauc"))
autoplot(trainres, type = "prc") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray") +
  geom_text(aes(x = 0.6, y = 0.2, label = paste("XGBoost PRAUC = ", round(auc_value, 2))),
    color = "black", size = 4
  ) +
  theme_minimal() +
  theme(
    panel.grid = element_blank(),
    panel.border = element_rect(color = "black", fill = NA, size = 1),
    plot.title = element_text(size = 10, face = "bold"),
    axis.title = element_text(size = 10),
    axis.text = element_text(size = 10),
    legend.position = "bottom"
  ) +
  labs(title = "XGBoost PR Curve")

Feature importance.

importance <- as.data.table(learner$importance(), keep.rownames = TRUE)
colnames(importance) <- c("Feature", "Importance")
importance$Featurename <- featurename[importance$Feature]


ggplot(data = importance, aes(x = reorder(Featurename, Importance), y = Importance)) +
  geom_col(fill = "#0073C2FF") +
  coord_flip() +
  xlab("") +
  ylab("Importance") +
  ggtitle("Feature Importance") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 12),
    axis.text.y = element_text(size = 12),
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title = element_text(size = 14),
    panel.grid = element_blank(),
    panel.border = element_rect(color = "black", fill = NA, linewidth = 1),
    axis.ticks = element_line(color = "black")
  )

2.5 cut-off

library(tidyverse)
library(cutpointr)

data <- readRDS("../data/train_score_lable.rds")

data$res <- ifelse(data$`Response to chemotherapy` == "PD", "Non response", "response")

set.seed(100)
opt_cut_manual <- cutpointr(
  data = data, xgbscore, res,
  direction = ">=", pos_class = "response", neg_class = "Non response",
  method = maximize_metric,
  # method = spec_constrain,
  metric = accuracy,
  # cutpoint = mean(data$prediction),
  boot_runs = 100
)
# cut-off
opt_cut_manual$optimal_cutpoint

[1] 0.5837688

plot_metric(opt_cut_manual) +
  theme_bw() +
  theme(panel.grid = element_blank()) +
  labs(title = "", x = "Threshold", y = "F1 score") +
  geom_vline(xintercept = 0.58, color = "orange", linetype = "dashed")

2.6 Distribution of important features

Distribution of important features in different datasets.

library(ggpubr)
library(patchwork)
library(ggsci)

allset_feature <- readRDS("../data/allset_feature.rds") %>% as.data.frame()
featurename <- readRDS("../data/featurename.rds")
allset_feature$batch <- factor(allset_feature$batch, levels = c("Train", "Test1", "Test2", "Test3"))

df <- allset_feature
df$lable <- factor(allset_feature$lable, levels = c(0, 1))

colnames(df)[1:7] <- featurename[colnames(df)[1:7]]

plots <- list()
for (col in names(df)[1:7]) {
  # p <- ggplot(df,aes_string(x = "batch", y =  paste0("`", col, "`"), color = "lable"))+
  p <- ggplot(df, aes(x = batch, y = !!sym(col), color = lable)) +
    geom_boxplot(aes(color = lable),
      alpha = 0.1, outlier.shape = NA
    ) +
    geom_jitter(
      position = position_jitterdodge(
        jitter.height = 0.1,
        jitter.width = 0.3,
        dodge.width = 0.5
      ),
      size = 0.4
    ) +
    scale_color_manual(values = c("#1e56a0", "#f95959")) +
    # scale_fill_manual(values = c("#1e56a0","#f95959"))+
    theme_bw() +
    theme(
      panel.grid = element_blank(),
      axis.text.x = element_text(angle = 45, hjust = 1)
    ) +
    stat_compare_means(aes(group = lable),
      method = "wilcox.test",
      label = "p.signif",
      show.legend = F
    ) +
    xlab("")

  plots[[col]] <- p
}

combined_plot <- wrap_plots(plots, ncol = ) +
  plot_layout(guides = "collect") & theme(legend.position = "bottom")

combined_plot

dev.off()

null device 
          1

PART 3: Model evaluation

Final Model Evaluation

The performance of CNA fingerprint on three independent test sets.

AUC,PRAUC

Defining figure plotting functions.

library(precrec)
library(dplyr)
library(ggplot2)
library(ggpubr)
library(pROC)

preform <- function(score, i) {
  score <- score[which(!is.na(score[, i])), ]
  score$lable <- factor(score$lable, levels = c(0, 1))
  pre_obj1 <- mmdata(score[, i], score$lable, posclass = 1)
  pre_obj1 <- evalmod(pre_obj1)
  auc <- precrec::auc(pre_obj1)
  pre1_df <- fortify(pre_obj1)
  # if(i=="prediction"){i="OXACNAscore"}
  if (i == "Aneuploidy_score") {
    i <- "Aneuploidy Score"
  }
  # if(i=="ploidy"){i="WGD"}
  pre1_df$Dataset <- i
  pre1_df$auc <- round(auc$aucs[1], 2)
  pre1_df$prc <- round(auc$aucs[2], 2)
  return(pre1_df)
}

jyplot <- function(list, type) {
  num <- data.frame(matrix(ncol = 3, nrow = length(list)))
  for (i in 1:length(list)) {
    num[i, 1] <- list[[i]]$Dataset[1]
    num[i, 2] <- list[[i]]$auc[1]
    num[i, 3] <- list[[i]]$prc[1]
  }
  #
  performance_df <- Reduce(rbind, list)
  cols <- c(
    "CNAfingerprint" = "#ea5455",
    "HRDCNAScore" = "#3fbac2",
    "scarHRD" = "#cabbe9",
    "Aneuploidy Score" = "#769fcd"
  )
  if (type == "auc") {
    roc <- performance_df[performance_df$curvetype == "ROC", ]
    p <- ggplot(roc, aes(x = x, y = y, group = Dataset)) +
      theme_bw() +
      geom_line(aes(color = Dataset), linewidth = 1.2) +
      xlab("1-Specificity") +
      ylab("Sensitivity") +
      theme(
        plot.title = element_text(hjust = 0.5),
        panel.grid = element_blank(),
        # line = element_line(color = "black", linewidth = 1.5,
        #                     linetype = 1, lineend = "butt"),
        axis.text.x = element_text(size = 14, color = "black"),
        axis.text.y = element_text(size = 14, color = "black"),
        axis.line = element_line(colour = "black"),
        legend.position = "none",
        title = element_text(size = 14),
      ) +
      scale_color_manual(values = cols) +
      guides(color = guide_legend(title = "Dataset")) +
      annotate("segment",
        x = 0, y = 0, xend = 1, yend = 1,
        color = "gray", size = 0.5, linetype = "dashed"
      ) +
      annotate("text",
        x = .6, y = .26, size = 3, colour = "#ea5455",
        label = paste(num[1, 1], "AUC =", num[1, 2])
      ) +
      annotate("text",
        x = .6, y = .19, size = 3, colour = "#3fbac2",
        label = paste(num[2, 1], "AUC =", num[2, 2])
      ) +
      annotate("text",
        x = .6, y = .12, size = 3, colour = "#cabbe9",
        label = paste(num[3, 1], "AUC =", num[3, 2])
      ) +
      annotate("text",
        x = .6, y = .05, size = 3, colour = "#769fcd",
        label = paste(num[4, 1], "AUC =", num[4, 2])
      )
  } else if (type == "prc") {
    prc <- performance_df[performance_df$curvetype == "PRC", ]

    p <- ggplot(prc, aes(x = x, y = y, group = Dataset)) +
      theme_bw() +
      geom_line(aes(color = Dataset), linewidth = 1) +
      xlab("Recall") +
      ylab("Precision") +
      theme(
        plot.title = element_text(hjust = 0.5),
        panel.grid = element_blank(),
        # line = element_line(color = "black", linewidth = 1.5,
        #                     linetype = 1, lineend = "butt"),
        axis.text.x = element_text(size = 14, color = "black"),
        axis.text.y = element_text(size = 14, color = "black"),
        axis.line = element_line(colour = "black"),
        legend.position = "none",
        title = element_text(size = 14),
        legend.text = element_text(size = 14),
        legend.title = element_text(size = 14)
      ) +
      coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +
      scale_color_manual(values = cols) +
      annotate("segment",
        x = 0, y = 1, xend = 1, yend = 0,
        color = "gray", size = 0.5, linetype = "dashed"
      ) +
      annotate("text",
        x = .4, y = .26, size = 3, colour = "#ea5455",
        label = paste(num[1, 1], "PR AUC =", num[1, 3])
      ) +
      annotate("text",
        x = .4, y = .19, size = 3, colour = "#3fbac2",
        label = paste(num[2, 1], "PR AUC =", num[2, 3])
      ) +
      annotate("text",
        x = .4, y = .12, size = 3, colour = "#cabbe9",
        label = paste(num[3, 1], "PR AUC =", num[3, 3])
      ) +
      annotate("text",
        x = .4, y = .05, size = 3, colour = "#769fcd",
        label = paste(num[4, 1], "PR AUC =", num[4, 3])
      )
  }

  return(p)
}

score_turn <- function(data, type) {
  if (type == "soft") {
    # data$ploidy <- max(data$ploidy)-data$ploidy
    data$HRDCNAScore <- 1 - data$HRDCNAScore
    data$scarHRD <- max(data$scarHRD) - data$scarHRD
    data$Aneuploidy_score <- (max(data$Aneuploidy_score) - data$Aneuploidy_score) / 39
  }
  if (type == "hard") {
    data$CNAfingerprint <- ifelse(data$CNAfingerprint > 0.58, 1, 0)
    # data$ploidy <- ifelse(data$ploidy>=4,0,1)
    data$HRDCNAScore <- ifelse(data$HRDCNAScore > 0.2, 0, 1)
    data$scarHRD <- (data$scarHRD - min(data$scarHRD)) / (max(data$scarHRD) - min(data$scarHRD))
    data$scarHRD <- 1 - data$scarHRD
    scarcut <- quantile(data$scarHRD, probs = 0.75)
    data$scarHRD <- ifelse(data$scarHRD > scarcut, 1, 0)
    #
    data$Aneuploidy_score <- data$Aneuploidy_score / 39
    as_cut <- quantile(data$Aneuploidy_score, probs = 0.75)
    data$Aneuploidy_score <- ifelse(data$Aneuploidy_score > as_cut, 0, 1)
    data$`Tumor site` <- ifelse(data$`Tumor site` == "Right side", 0, 1)
  }
  return(data)
}

Take the example of calculating the RO AUC and PR AUC for Test1 cohort.

data <- read.csv("../data/all_test.csv", check.names = F)

#
data <- data[which(data$batch == "Test1"), ]

data_auc <- score_turn(data, "soft")

preform_list <- list()
for (i in c("CNAfingerprint", "HRDCNAScore", "scarHRD", "Aneuploidy_score")) {
  a <- preform(data_auc, i)
  preform_list <- c(preform_list, list(a))
}

# auc prc
library(patchwork)
p <- jyplot(preform_list, "auc")
p2 <- jyplot(preform_list, "prc")
p + p2

The ROAUC and PRAUC for all test cohorts.

for no-oxaliplatin cohort,we also get the AUC.

noOxa_cohort <- readRDS("../data/noOxa_cohort.rds")
data <- na.omit(noOxa_cohort)
data$lable <- ifelse(data$res == "PD", 0, 1)

plot.roc(
  data$lable,
  data$CNAfingerprint,
  col = "#3d84a8",
  percent = TRUE,
  lwd = 2.5,
  print.auc = TRUE,
  print.auc.cex = 0.7,
  print.auc.pattern = paste0("CNA fingerprint", ": %.1f%%"),
  print.auc.y = 35
)

PART 4: Biomarker comparison

The performance of CNA fingerprint in oxaliplatin response prediction was compared with: –HRD status –KRAS mutation –aneuploidy –primary tumor location.

4.1 HRD status and aneuploidy

We use R package HRDCNA(Using the threshold values 0.2 recommended by the tool guideline. see 1.2 CNA features calling, function CNF_call) and scarHRD to assess the HRD status score;

# library(devtools)
# install_bitbucket('sequenza_tools/sequenza')
# install_github('sztup/scarHRD',build_vignettes = TRUE)
# Note that this R package installation is detailed at https://github.com/sztup/scarHRD

library(scarHRD)

# input format
# SampleID  Chromosome  Start_position  End_position    total_cn    A_cn    B_cn    ploidy
# SamplePatient1       chr1          14574       952448        5    0    5

exampleSeg <- readRDS("../data/exampleSeg.rds")

# Assuming a ploidy of 5
ploidy <- 5
data <- data.frame(exampleSeg[, c(5, 1:4)],
  A_cn = exampleSeg$segVal,
  B_cn = 0,
  ploidy = ploidy
)

colnames(data) <- c(
  "SampleID", "Chromosome", "Start_position", "End_position",
  "total_cn", "A_cn", "B_cn", "ploidy"
)

data %>%
  group_by(SampleID) %>%
  group_split() %>%
  walk(~ {
    group_name <- unique(.x$SampleID)
    # save as csv
    write.table(.x,
      file = paste0("../data/scarHRD/", group_name, ".txt"),
      row.names = FALSE, quote = F, sep = "\t"
    )
  })

csv_files <- list.files(
  path = "../data/scarHRD/",
  pattern = "\\.txt$", full.names = TRUE
)
score <- data.frame()
for (file in csv_files) {
  sample <- gsub(".txt", "", basename(file))
  scar_train <- scar_score(file, reference = "grch38", seqz = FALSE)
  df <- data.frame(scar_train, sample = sample)
  score <- rbind(score, df)
}

R package AneuploidyScore was used to calculate the aneuploidy score (total number of arm level gains/losses).

Note that this R package only supports hg19, so we first need to perform a genome version conversion

# devtools::install_github("quevedor2/aneuploidy_score", ref='master')
library(GenomicRanges)
library(liftOver)
library(AneuploidyScore)

cytoarm <- cytobandToArm(ucsc.hg19.cytoband)
ch <- import.chain("../data/hg38ToHg19.over.chain")

exampleSeg <- readRDS("../data/exampleSeg.rds")
# Assuming a ploidy of 5
ploidy <- 5

out <- data.frame()

for (i in unique(exampleSeg$sample)) {
  segdata <- exampleSeg[exampleSeg$sample == i, ]
  # ploidy=pp[which(pp$sample==i),"polidy"]
  ploidy <- 5
  seg <- GRanges(
    # seqnames = paste0("chr",segdata$Chromosome),
    seqnames = segdata$chromosome,
    ranges = IRanges(start = segdata$start, end = segdata$end),
    strand = "*",
    TCN = segdata$segVal
  )
  hg19.seg <- liftOver(seg, ch)
  seg_caa <- getCAA(hg19.seg, cytoarm,
    tcn_col = "TCN", classifyCN = TRUE,
    ploidy = ploidy, threshold = 0.5
  )
  caa <- reduceArms(seg_caa,
    caa_method = c("arm", "seg"),
    arm_ids = c("p", "q")
  )

  result <- data.frame(
    sample = i,
    t(caa[, 1]),
    aneuploidy_score = colSums(abs(caa), na.rm = TRUE)
  )
  out <- rbind(out, result[1, ])
}
# Removal of sex chromosomes
aneuploidy_score <- rowSums(abs(out[, 2:45]), na.rm = TRUE)
aneuploidy_score <- data.frame(sample = out$sample, AS = aneuploidy_score)

4.2 Comparison and application

library(caret)

all_test <- read.csv("../data/all_test.csv", check.names = F)
# from change
score_turn <- function(data, type) {
  if (type == "soft") {
    data$HRDCNAScore <- 1 - data$HRDCNAScore
    data$scarHRD <- max(data$scarHRD) - data$scarHRD
    data$Aneuploidy_score <- (max(data$Aneuploidy_score) - data$Aneuploidy_score) / 39
  }
  if (type == "hard") {
    data$CNAfingerprint <- ifelse(data$CNAfingerprint > 0.58, 1, 0)
    data$HRDCNAScore <- ifelse(data$HRDCNAScore > 0.2, 0, 1)
    data$scarHRD <- (data$scarHRD - min(data$scarHRD)) / (max(data$scarHRD) - min(data$scarHRD))
    data$scarHRD <- 1 - data$scarHRD
    scarcut <- quantile(data$scarHRD, probs = 0.75)
    data$scarHRD <- ifelse(data$scarHRD > scarcut, 1, 0)
    #
    data$Aneuploidy_score <- data$Aneuploidy_score / 39
    as_cut <- quantile(data$Aneuploidy_score, probs = 0.75)
    data$Aneuploidy_score <- ifelse(data$Aneuploidy_score > as_cut, 0, 1)
    data$`Tumor site` <- ifelse(data$`Tumor site` == "Right side", 0, 1)
  }
  return(data)
}


score <- c("CNAfingerprint", "HRDCNAScore", "scarHRD", "Aneuploidy_score", "Tumor site")

table <- data.frame(matrix(nrow = 12))

for (set in unique(all_test$batch)) {
  data <- all_test[which(all_test$batch == set), ]
  data <- score_turn(data, type = "hard")
  for (i in score) {
    df <- data[, c("lable", i)]
    colnames(df)[2] <- "score"
    cm <- caret::confusionMatrix(
      data = factor(df$score, levels = c(0, 1)),
      reference = factor(df$lable, levels = c(0, 1)),
      positive = "1"
    )

    list <- as.data.frame(cm$byClass)

    list["Accuracy", 1] <- cm$overall[1]
    colnames(list)[1] <- paste0(set, "_", i)

    table <- cbind(table, list)
  }
}

table <- table[, -1]
saveRDS(table, file = "../data/allscore_CMtable.rds")

Heat maps

library(pheatmap)
allscore_CMtable <- readRDS("../data/allscore_CMtable.rds")

data <- t(allscore_CMtable[c(12, 1, 2, 5:7), ])

pheatmap(data,
  cluster_rows = F,
  cluster_cols = F,
  angle_col = "45",
  color = colorRampPalette(c("#3e92a3", "white", "#ff5335"))(100),
  border_color = "white",
  number_format = "%.2f",
  display_numbers = round(data, 2),
  legend = T
)

4.3 Compare to KRAS mutation

Limited by sample size, we will observe the predictive effect of KRAS mutations on chemotherapy response in all samples annotated with KRAS mutations in the 271 sample.

A total of 95 patients had KRAS mutations and 117 patients did not detect any KRAS,BRAF or NRAS mutations

data <- read.csv("../data/mut.csv")
data$type <- ifelse(data$Response.to.chemotherapy == "PD", 0, 1)
data$mut <- ifelse(data$Hotspot.mutation == "WT", 1, 0)

cm <- caret::confusionMatrix(
  data = factor(data$type),
  reference = factor(data$mut), positive = "1"
)

print(paste0("Accuracy of KRASmutation is ", cm$overall[1]))

[1] "Accuracy of KRASmutation is 0.518867924528302"

#
cm$byClass

         Sensitivity          Specificity       Pos Pred Value 
           0.5128205            0.5263158            0.5714286 
      Neg Pred Value            Precision               Recall 
           0.4672897            0.5714286            0.5128205 
                  F1           Prevalence       Detection Rate 
           0.5405405            0.5518868            0.2830189 
Detection Prevalence    Balanced Accuracy 
           0.4952830            0.5195682

PART 5: Overall survival

Exploring the Predictive Performance of CNA fingerprint for Patient Overall Survival.

Test3 cohort doesn’t have enough information for overall survival.

library(survival)
library(survminer)
library(tidyr)

rm(list = ls())

Overall Survival of test1 and test2 cohort

library(dplyr)

test_all <- read.csv("../data/all_test.csv", check.names = F)

data <- test_all %>%
  filter(batch == "Test1") %>%
  mutate(
    prediction = factor(ifelse(CNAfingerprint > 0.58, "Responder", "Non-responder"),
      levels = c("Non-responder", "Responder")
    ),
    OS = OS / 30
  )

fit <- survfit(Surv(`OS`, `OS_event`) ~ prediction, data = na.omit(data[, 13:15]))

ggsurvplot(
  fit,
  data = data[, 13:15],
  size = 1,
  conf.int = TRUE,
  pval = TRUE,
  pval.method = TRUE,
  risk.table = TRUE,
  palette = c("#3d6cb9", "#bc5148"),
  risk.table.height = 0.25,
  ggtheme = theme_classic2(),
  xlab = "Months"
)

data <- test_all %>%
  filter(batch == "Test2") %>%
  mutate(
    prediction = factor(ifelse(CNAfingerprint > 0.58, "Responder", "Non-responder"),
      levels = c("Non-responder", "Responder")
    ),
    OS = OS / 30
  )

fit <- survfit(Surv(`OS`, `OS_event`) ~ prediction, data = na.omit(data[, 13:15]))

ggsurvplot(
  fit,
  data = data[, 13:15],
  size = 1,
  conf.int = TRUE,
  pval = TRUE,
  pval.method = TRUE,
  risk.table = TRUE,
  palette = c("#3d6cb9", "#bc5148"),
  risk.table.height = 0.25,
  ggtheme = theme_classic2(),
  xlab = "Months"
)

Overall Survival of GSE36864 cohort

GSE36864, aCGH data of colorectal cancers of patients from 2 clinical trials (CAIRO, CAIRO2). 105 patients were treated with capecitabine first line (CAIRO arm A), 111 patients were treated with capecitabine and irinotecan first line (CAIRO arm B), and 133 patients were treated with capecitabine, oxaliplatin and bevacizumab (CAIRO2 arm A).

data <- readRDS("../data/GSE36864_clin.rds")

data$OS_event <- ifelse(data$`progression:ch1` == "yes", 1, 0)
colnames(data)[3] <- "OS"
data$OS <- as.numeric(data$OS) / 30
data$prediction <- ifelse(data$CNAfingerprint > 0.58, 1, 0)

Survival of patients with oxaliplatin-based regimens (GSE36864 cohort, CAIRO2 arm A).

df <- data[which(data$`study:ch1` == "CAIRO2"), ]

fit <- survfit(Surv(`OS`, `OS_event`) ~ prediction, data = na.omit(df))

ggsurvplot(
  fit,
  data = df,
  size = 1,
  conf.int = TRUE,
  pval = TRUE,
  pval.method = TRUE,
  risk.table = TRUE,
  palette = c("#3d6cb9", "#bc5148"),
  risk.table.height = 0.25,
  ggtheme = theme_classic2(),
  xlab = "Months"
)

Survival of patients without oxaliplatin-based regimens (GSE36864 cohort, CAIRO).

df <- data[which(data$`study:ch1` == "CAIRO"), ]

fit <- survfit(Surv(`OS`, `OS_event`) ~ prediction, data = na.omit(df))

ggsurvplot(
  fit,
  data = df,
  size = 1,
  conf.int = TRUE,
  pval = TRUE,
  pval.method = TRUE,
  risk.table = TRUE,
  palette = c("#3d6cb9", "#bc5148"),
  risk.table.height = 0.25,
  ggtheme = theme_classic2(),
  xlab = "Months"
)

Overall Survival of no-oxaliplatin cohort

noOxa_cohort <- readRDS("../data/noOxa_cohort.rds")
data <- na.omit(noOxa_cohort)
data$group <- factor(ifelse(data$CNAfingerprint > 0.58, "Responder", "Non-responder"),
  levels = c("Non-responder", "Responder")
)

fit <- survfit(Surv(`OS`, `OS_event`) ~ group, data = data)


ggsurvplot(
  fit,
  data = data,
  size = 1,
  conf.int = TRUE,
  pval = TRUE,
  pval.method = TRUE,
  risk.table = TRUE,
  palette = c("#3d6cb9", "#bc5148"),
  risk.table.height = 0.25,
  ggtheme = theme_classic2(),
  xlab = "Months"
)

Acknowledgments

We thank Fudan University Shanghai Cancer Center and Tongji hospital of Tongji University for sample collection;

We thank ShanghaiTech University High Performance Computing Public Service Platform for computing services. We thank multi-omics facility, molecular and cell biology core facility of ShanghaiTech University for technical help.

LICENSE

If you want to reuse the code in this report, please note the license below should be followed.

The code is made available for non commercial research purposes only under the MIT. However, notwithstanding any provision of the MIT License, the software currently may not be used for commercial purposes without explicit written permission after contacting Jinyu Wangwangjy10@shanghaitech.edu.cn or Tiyu Taotaozy@shanghaitech.edu.cn or Xue-Song Liu liuxs@shanghaitech.edu.cn.