Copy number alteration fingerprint predicts the clinical response of oxaliplatin-based chemotherapy in metastatic colorectal cancer
EQUAL CONTRIBUTION
J.W., J.W., Z.T. and T.W. contributed equally to this work as joint first authors.
X.L., X.L. and X.S.L contributed equally to this work as joint senior authors.
Introduction
Oxaliplatin-based chemotherapy is routinely used in metastatic colorectal cancer (mCRC) patient, however only about half of patients show clinical response after therapy. Accurate biomarkers for predicting oxaliplatin chemotherapy clinical response are still lacking. Here we developed and validated a tumor genomic DNA copy number alteration (CNA) based biomarker.
In total 271 mCRC patients who underwent primary tumor resection and postoperatively treated with oxaliplatin-based chemotherapy are included in this study. Tissue samples and their clinical data were obtained from the Fudan University Shanghai Cancer Center (FUSCC) and Tongji Hospital of Tongji University. The sample selection criteria are detailed here.
PART 1: Data preprocessing
To develop a stable, effective, and cost-efficient predictive biomarker for oxaliplatin-based chemotherapy response in metastatic colorectal cancer patients, we utilized shallow whole-genome sequencing(sWGS) data to extract copy number alteration(CNA) features. These features, combined with clinical characteristics, were used to construct a machine learning model, ultimately resulting in the development of a highly predictive model known as the CNA fingerprint.
1.1 Raw data processing
suppose i is the sample name. Using follow tools:
fastp
Quality control.
BWA MEM
Mapping.
Samtools
Transform sam to bam file.
1.2 CNA features calling
suppose you have got your bam file. Using follow tools:
QDNAseq
CBC segment,the defult genome is hg19,here we use hg38,and binsize as 100.
library(QDNAseq)
library(QDNAseq.hg38)
library(stringr)
library(ACE)
library(DNAcopy)
args <- commandArgs(trailingOnly = TRUE)
# args 1,sample name
# args 2,path to bam
# args 3,the output path
i <- args[1]
s <- paste0(i, "_sorted.bam")
# binsize: 100
bins <- getBinAnnotations(binSize = 100, genome = "hg38")
readCounts <- binReadCounts(bins, bamfile = paste0(args[2], "/", s))
readCountsFiltered <- applyFilters(readCounts, residual = TRUE, blacklist = TRUE)
readCountsFiltered <- estimateCorrection(readCountsFiltered)
copyNumbers <- correctBins(readCountsFiltered)
copyNumbersNormalized <- normalizeBins(copyNumbers)
copyNumbersSmooth <- smoothOutlierBins(copyNumbersNormalized)
copyNumbersSegmented <- segmentBins(copyNumbersSmooth, transformFun = "log2")
copyNumbersSegmented <- normalizeSegmentedBins(copyNumbersSegmented)
save(copyNumbersSegmented, file = "/your_path/copyNumbersSegmented.rda")
you can run like:
ACE
Fit the cellularity and ploidy
sqm <- squaremodel(copyNumbersSegmented, QDNAseqobjectsample = 1,
penalty = 0.65, penploidy = 0.65,
ptop = 5.3, pbottom = 1.8, prows = 250,
cellularities = seq(50, 100, length.out = 50),
sgc = c("X", "Y"),exclude=c())
# Here we get the purity ploidy and error values of all the fits, referring to the official ACE document that states “the minimum error is not necessarily the best fit”, it is recommended to select combinations with higher cellular purity in the similar minimum error set, and at the same time select combinations with higher ploidy based on the a priori knowledge of the genomes of colon cancer.
mdf <- sqm$minimadf
mdf <- mdf[order(mdf$error,-mdf$cellularity),]
# Manual screening and determination ploidy and cellularity,for example:
ploidy = 7.6
cellularity = 0.8
# Absoult CNA calling
m<-ACEcall(copyNumbersSegmented, QDNAseqobjectsample = TRUE, scellularity = cellularity,
ploidy = ploidy, standard =1, plot = FALSE,
qcutoff = -3, subcutoff = 0.2, trncname = FALSE,
cap = 12, bottom = 0,
onlyautosomes = TRUE
)
segment <- data.frame(copyNumbersSegmented@featureData@data[1:nrow(m),1:3],segment_mean=m$segment_mean)
segment <- na.omit(segment)
segment$segment_mean <- round(segment$segment_mean)
Sigminer
CN features calling
here we got the CNA segment files,first we do a filter,delect all segments with num_marks <10,make the file colnames as: chr,start,end segval,sample
We extracted CNA features described in our previous studies(2021 Wang et.al, 2023 Tao et.al) and also global CNA status indicator: CNA burden (proportion of CNA in the genome18); Chromosomal amplification (AMP) or deletion (DEL) ratio; Chromosomal AMP/DEL level (diploid-referenced change magnitude).
data <- readRDS("../data/exampleSeg.rds")
CNF_call <- function(data, note = NULL) {
library(sigminer)
CN <- data[, c("chromosome", "start", "end", "segVal", "sample")]
# Wang et al.
cn <- read_copynumber(CN,
seg_cols = c("chromosome", "start", "end", "segVal"),
samp_col = "sample",
genome_build = "hg38", complement = FALSE
)
#
cn_sum <- cn@summary.per.sample
rownames(cn_sum) <- cn_sum$sample
tally_w <- sig_tally(cn, method = "W", cores = 10)
if (!is.null(note)) {
saveRDS(tally_w, file = paste0("./data/results/", note, "_tally_W.rds"))
}
cn_fet <- tally_w[["nmf_matrix"]]
feature_m <- data.frame(cn_fet)
colnames(feature_m) <- colnames(cn_fet)
feature_m$sample <- rownames(cn_fet)
# if only one sample,rownames(cn_fet)will be null
colon_cnf <- merge(cn_sum, feature_m, by = "sample", all = T)
# Yao et al.
# HRD(cutoff 0.2)
library(HRDCNA)
score_yhz <- HRDprediction(data = feature_m)
colon_cnf <- merge(colon_cnf, score_yhz, by = "sample", all = T)
# Tao et al.
tally_X_noLOH <- sig_tally(cn,
method = "X",
add_loh = FALSE,
cores = 10
)
if (!is.null(note)) {
saveRDS(tally_X_noLOH, file = paste0("./data/results/", note, "_tally_X_noLOH.rds"))
}
cn_fet <- tally_X_noLOH[["all_matrices"]][["simplified_matrix"]]
feature_m <- data.frame(cn_fet)
colnames(feature_m) <- colnames(cn_fet)
feature_m$sample <- rownames(cn_fet)
colon_cnf <- merge(colon_cnf, feature_m, by = "sample", all = T)
# global CNA status(amp and del states of every chr)
data$chromosome <- ifelse(startsWith(data$chromosome, "chr"),
data$chromosome,
paste0("chr", data$chromosome)
)
gCNA <- data.frame(matrix(nrow = 0, ncol = 88))
coln <- c()
for (i in 1:22) {
coln <- c(
coln, paste0("chr[", i, "]AMPratio"),
paste0("chr[", i, "]AMPlevel"),
paste0("chr[", i, "]DELratio"),
paste0("chr[", i, "]DELlevel")
)
}
chrname <- paste0("chr", 1:22)
for (i in unique(data$sample)) {
df <- data[data$sample == i, ]
onechr <- c(i)
for (j in chrname) {
df1 <- df[df$chromosome == j, ]
totallength <- sum(df1$end - df1$start)
if (any(df1$segVal > 2)) {
al <- df1[which(df1$segVal > 2), 3] - df1[which(df1$segVal > 2), 2]
ampr <- sum(al) / totallength
ampl <- sum(al * df1[which(df1$segVal > 2), 4]) / sum(al * 2)
} else {
ampr <- 0
ampl <- 0
}
if (any(df1$segVal < 2)) {
dl <- df1[which(df1$segVal < 2), 3] - df1[which(df1$segVal < 2), 2]
delr <- sum(dl) / totallength
dell <- sum(dl * df1[which(df1$segVal < 2), 4]) / sum(dl * 2)
} else {
delr <- 0
dell <- 0
}
onechr <- c(
onechr,
round(ampr, 4),
round(ampl, 4),
round(delr, 4),
round(dell, 4)
)
}
onechr <- data.frame(t(onechr))
colnames(onechr) <- c("sample", coln)
gCNA <- rbind(gCNA, onechr)
}
colon_cnf <- merge(colon_cnf, gCNA, by = "sample")
colon_cnf[, 2:312] <- lapply(colon_cnf[, 2:312], as.numeric)
return(colon_cnf)
}
features <- CNF_call(data, note = NULL)
saveRDS(features, file = "../data/examplefeatures.rds")
1.3 Feature distribution
TCGA data as an example, the distribution(W method and X method) of its features is demonstrated by function show_catalogue.
library(sigminer)
ace_tally_W <- readRDS("../data/ace_tally_W.rds")
tcga_tally_W <- readRDS("../data/tcga_tally_W.rds")
show_catalogue(t(ace_tally_W[, -1]), style = "cosmic", mode = "copynumber", x_label_angle = 90)
show_catalogue(t(tcga_tally_W[, -1]), style = "cosmic", mode = "copynumber", x_label_angle = 90)
ace_tally_X <- readRDS("../data/ace_tally_X.rds")
tcga_tally_X <- readRDS("../data/tcga_tally_X.rds")
show_catalogue(t(ace_tally_X[, -1]), style = "cosmic", mode = "copynumber", method = "T", x_label_angle = 90)
show_catalogue(t(tcga_tally_X[, -1]), style = "cosmic", mode = "copynumber", method = "T", x_label_angle = 90)
The CNA feature distribution in TCGA mCRCs.
The CNA feature distribution( W method ) in Train cohort.
The CNA feature distribution( X method ) in Train cohort.
The CNA feature distribution in Test1,Test2 and Test3 cohort can be presented in the same way.
PART 2: Model training
2.1 Internal benchmark
We performed internal benchmark to compare the performance of different CNA feature sets and 5 different machine learning models (1) extreme gradient boosting (XGBoost), (2) neural networks (NNET), (3) random forest (RF), (4) Naive bayes, (5) light gradient boosting machine (LGBM) in the training dataset.
The CNA information we compare here is the CNA signature reported by Drew et.al
Example calling process of CNA signature reported by Drews et.al.
# devtools::install_github("markowetzlab/CINSignatureQuantification", build_vignettes = TRUE, dependencies = TRUE)
# input format
# chromosome start end segVal sample
# 1 61735 249224388 3.1 TCGA-BT-A20P
library(tidyverse)
library(CINSignatureQuantification)
data <- readRDS("../data/exampleSeg.rds")
mySigs <- quantifyCNSignatures(swgs_segment.ace_train)
D_sig <- mySigs@activities[["rawAct0"]]
# Low segment counts: Features not computed for 8 of 121 samples - see `getSampleFeatures()`
All feature-model combinations were feature-selected and trained and then banchmarked by five-fold cross-validation. here we only show the training process of XGBoost in next section.
2.2 Model Training
We train the model with the help of mlr3. Create our own task:
library(mlr3verse)
library(ggplot2)
train_cnf <- readRDS("../data/train_cnf.rds")
data <- train_cnf %>%
mutate(lable = ifelse(train_cnf$lable == "PD", 0, 1))
colnames(data)[1:(ncol(data) - 1)] <- paste0("feature", 1:(ncol(data) - 1))
# keep the feature names
featurename <- colnames(train_cnf)[1:(ncol(train_cnf) - 1)]
names(featurename) <- paste0("feature", 1:(ncol(data) - 2))
saveRDS(featurename, file = "../data/featurename.rds")
# task
tsk_train <- as_task_classif(data, target = "lable", positive = "1")
we use the property “importance” of XGBoost for perform initial feature selection.
set.seed(2025)
learner <- lrn("classif.xgboost", predict_type = "prob")
filter <- flt("importance", learner = learner)
filter$calculate(tsk_train)
diffeaure <- imp$feature
tsk_train$select(diffeaure)
Then, tsk_train keep 90 features, we use a grid search to determine the parameter: “nrounds”.
set.seed(2025)
learner <- lrn("classif.xgboost",
predict_type = "prob", eta = 0.1,
nrounds = to_tune(80, 250),
nthread = 20
)
instance <- tune(
tuner = mlr3tuning::tnr("grid_search", resolution = 170, batch_size = 10),
task = tsk_train,
learner = learner,
resampling = rsmp("cv", folds = 5),
measures = msr("classif.auc"),
terminator = trm("stagnation", iters = 200, threshold = 0.05)
)
# results get nrounds = 132
instance$result_learner_param_vals
# other parameters
set.seed(2025)
learner <- lrn("classif.xgboost",
predict_type = "prob", nrounds = 132,
nthread = 20
)
search_space <- ps(
eta = p_dbl(lower = .001, upper = .2),
max_depth = p_int(lower = 1, upper = 10),
# max_delta_step = p_int(lower = 1,upper = 10),
alpha = p_dbl(lower = 0, upper = 0.5),
lambda = p_dbl(lower = 0, upper = 1),
gamma = p_int(lower = 0, upper = 6),
min_child_weight = p_int(lower = 1, upper = 10),
subsample = p_dbl(lower = .8, upper = 1)
# colsample_bytree = p_dbl( lower = .5, upper = 1)
# colsample_bylevel = p_dbl(lower = .5, upper = 1)
)
instance <- tune(
tuner = mlr3tuning::tnr("random_search", batch_size = 10),
task = tsk_train,
learner = learner,
search_space = search_space,
resampling = rsmp("cv", folds = 5),
measures = msr("classif.auc"),
terminator = trm("stagnation", iters = 200, threshold = 0.01)
)
# Model parameter update
set.seed(2025)
learner$param_set$values <- instance$result_learner_param_vals
For other parameter mediation, given the large number of combinations, we use “random_search”.you can also use”grid_search”,but it would be advisable to first conduct parameter searches for a subset, then utilize the optimized subset to search for other parameters, rather than running all parameters simultaneously.
After we update the parameters,the model also performs feature selection,
importance <- as.data.table(learner$importance(), keep.rownames = TRUE)
colnames(importance) <- c("Feature", "Importance")
importance$Featurename <- featurename[importance$Feature]
Next, we adjust again based on these 7 features, the adjustment process is the same as above.
diffeaure <- importance$Feature
saveRDS(diffeaure, file = "../data/xgb_feature7.rds")
tsk_train$select(diffeaure)
set.seed(2025)
learner <- lrn("classif.xgboost",
predict_type = "prob", eta = 0.1,
nrounds = to_tune(80, 250),
nthread = 20
)
instance <- tune(
tuner = mlr3tuning::tnr("grid_search", resolution = 170, batch_size = 10),
task = tsk_train,
learner = learner,
resampling = rsmp("cv", folds = 5),
measures = msr("classif.auc"),
terminator = trm("stagnation", iters = 200, threshold = 0.05)
)
# results get nrounds = 83
instance$result_learner_param_vals
# other parameters
set.seed(2025)
learner <- lrn("classif.xgboost",
predict_type = "prob", nrounds = 83,
nthread = 20
)
search_space <- ps(
eta = p_dbl(lower = .001, upper = .2),
max_depth = p_int(lower = 1, upper = 10),
# max_delta_step = p_int(lower = 1,upper = 10),
alpha = p_dbl(lower = 0, upper = 0.5),
lambda = p_dbl(lower = 0, upper = 1),
gamma = p_int(lower = 0, upper = 6),
min_child_weight = p_int(lower = 1, upper = 10),
subsample = p_dbl(lower = .8, upper = 1)
# colsample_bytree = p_dbl( lower = .5, upper = 1)
# colsample_bylevel = p_dbl(lower = .5, upper = 1)
)
instance <- tune(
tuner = mlr3tuning::tnr("random_search", batch_size = 10),
task = tsk_train,
learner = learner,
search_space = search_space,
resampling = rsmp("cv", folds = 5),
measures = msr("classif.auc"),
terminator = trm("stagnation", iters = 200, threshold = 0.01)
)
2.4 Final model
Update the model again to get the final model.
# Model parameter update
set.seed(2025)
learner$param_set$values <- instance$result_learner_param_vals
learner$train(tsk_train)
saveRDS(learner, file = "../data/xgboostle.rds")
AUC and PR AUC of the model on the training set.
trainres <- learner$predict(tsk_train)
auc_value <- trainres$score(msr("classif.auc"))
autoplot(trainres, type = "roc") +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray") +
geom_text(aes(x = 0.6, y = 0.2, label = paste("XGBoost AUC = ", round(auc_value, 4))),
color = "black", size = 4
) +
theme_minimal() +
theme(
panel.grid = element_blank(),
panel.border = element_rect(color = "black", fill = NA, size = 1),
plot.title = element_text(size = 10, face = "bold"),
axis.title = element_text(size = 10),
axis.text = element_text(size = 10),
legend.position = "bottom"
) +
labs(title = "XGBoost ROC Curve")
auc_value <- trainres$score(msr("classif.prauc"))
autoplot(trainres, type = "prc") +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray") +
geom_text(aes(x = 0.6, y = 0.2, label = paste("XGBoost PRAUC = ", round(auc_value, 2))),
color = "black", size = 4
) +
theme_minimal() +
theme(
panel.grid = element_blank(),
panel.border = element_rect(color = "black", fill = NA, size = 1),
plot.title = element_text(size = 10, face = "bold"),
axis.title = element_text(size = 10),
axis.text = element_text(size = 10),
legend.position = "bottom"
) +
labs(title = "XGBoost PR Curve")
Feature importance.
importance <- as.data.table(learner$importance(), keep.rownames = TRUE)
colnames(importance) <- c("Feature", "Importance")
importance$Featurename <- featurename[importance$Feature]
ggplot(data = importance, aes(x = reorder(Featurename, Importance), y = Importance)) +
geom_col(fill = "#0073C2FF") +
coord_flip() +
xlab("") +
ylab("Importance") +
ggtitle("Feature Importance") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 12),
axis.text.y = element_text(size = 12),
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.title = element_text(size = 14),
panel.grid = element_blank(),
panel.border = element_rect(color = "black", fill = NA, linewidth = 1),
axis.ticks = element_line(color = "black")
)
2.5 cut-off
library(tidyverse)
library(cutpointr)
data <- readRDS("../data/train_score_lable.rds")
data$res <- ifelse(data$`Response to chemotherapy` == "PD", "Non response", "response")
set.seed(100)
opt_cut_manual <- cutpointr(
data = data, xgbscore, res,
direction = ">=", pos_class = "response", neg_class = "Non response",
method = maximize_metric,
# method = spec_constrain,
metric = accuracy,
# cutpoint = mean(data$prediction),
boot_runs = 100
)
# cut-off
opt_cut_manual$optimal_cutpoint
[1] 0.5837688
plot_metric(opt_cut_manual) +
theme_bw() +
theme(panel.grid = element_blank()) +
labs(title = "", x = "Threshold", y = "F1 score") +
geom_vline(xintercept = 0.58, color = "orange", linetype = "dashed")
2.6 Distribution of important features
Distribution of important features in different datasets.
library(ggpubr)
library(patchwork)
library(ggsci)
allset_feature <- readRDS("../data/allset_feature.rds") %>% as.data.frame()
featurename <- readRDS("../data/featurename.rds")
allset_feature$batch <- factor(allset_feature$batch, levels = c("Train", "Test1", "Test2", "Test3"))
df <- allset_feature
df$lable <- factor(allset_feature$lable, levels = c(0, 1))
colnames(df)[1:7] <- featurename[colnames(df)[1:7]]
plots <- list()
for (col in names(df)[1:7]) {
# p <- ggplot(df,aes_string(x = "batch", y = paste0("`", col, "`"), color = "lable"))+
p <- ggplot(df, aes(x = batch, y = !!sym(col), color = lable)) +
geom_boxplot(aes(color = lable),
alpha = 0.1, outlier.shape = NA
) +
geom_jitter(
position = position_jitterdodge(
jitter.height = 0.1,
jitter.width = 0.3,
dodge.width = 0.5
),
size = 0.4
) +
scale_color_manual(values = c("#1e56a0", "#f95959")) +
# scale_fill_manual(values = c("#1e56a0","#f95959"))+
theme_bw() +
theme(
panel.grid = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1)
) +
stat_compare_means(aes(group = lable),
method = "wilcox.test",
label = "p.signif",
show.legend = F
) +
xlab("")
plots[[col]] <- p
}
combined_plot <- wrap_plots(plots, ncol = ) +
plot_layout(guides = "collect") & theme(legend.position = "bottom")
combined_plot
null device
1
PART 3: Model evaluation
Final Model Evaluation
The performance of CNA fingerprint on three independent test sets.
AUC,PRAUC
Defining figure plotting functions.
library(precrec)
library(dplyr)
library(ggplot2)
library(ggpubr)
library(pROC)
preform <- function(score, i) {
score <- score[which(!is.na(score[, i])), ]
score$lable <- factor(score$lable, levels = c(0, 1))
pre_obj1 <- mmdata(score[, i], score$lable, posclass = 1)
pre_obj1 <- evalmod(pre_obj1)
auc <- precrec::auc(pre_obj1)
pre1_df <- fortify(pre_obj1)
# if(i=="prediction"){i="OXACNAscore"}
if (i == "Aneuploidy_score") {
i <- "Aneuploidy Score"
}
# if(i=="ploidy"){i="WGD"}
pre1_df$Dataset <- i
pre1_df$auc <- round(auc$aucs[1], 2)
pre1_df$prc <- round(auc$aucs[2], 2)
return(pre1_df)
}
jyplot <- function(list, type) {
num <- data.frame(matrix(ncol = 3, nrow = length(list)))
for (i in 1:length(list)) {
num[i, 1] <- list[[i]]$Dataset[1]
num[i, 2] <- list[[i]]$auc[1]
num[i, 3] <- list[[i]]$prc[1]
}
#
performance_df <- Reduce(rbind, list)
cols <- c(
"CNAfingerprint" = "#ea5455",
"HRDCNAScore" = "#3fbac2",
"scarHRD" = "#cabbe9",
"Aneuploidy Score" = "#769fcd"
)
if (type == "auc") {
roc <- performance_df[performance_df$curvetype == "ROC", ]
p <- ggplot(roc, aes(x = x, y = y, group = Dataset)) +
theme_bw() +
geom_line(aes(color = Dataset), linewidth = 1.2) +
xlab("1-Specificity") +
ylab("Sensitivity") +
theme(
plot.title = element_text(hjust = 0.5),
panel.grid = element_blank(),
# line = element_line(color = "black", linewidth = 1.5,
# linetype = 1, lineend = "butt"),
axis.text.x = element_text(size = 14, color = "black"),
axis.text.y = element_text(size = 14, color = "black"),
axis.line = element_line(colour = "black"),
legend.position = "none",
title = element_text(size = 14),
) +
scale_color_manual(values = cols) +
guides(color = guide_legend(title = "Dataset")) +
annotate("segment",
x = 0, y = 0, xend = 1, yend = 1,
color = "gray", size = 0.5, linetype = "dashed"
) +
annotate("text",
x = .6, y = .26, size = 3, colour = "#ea5455",
label = paste(num[1, 1], "AUC =", num[1, 2])
) +
annotate("text",
x = .6, y = .19, size = 3, colour = "#3fbac2",
label = paste(num[2, 1], "AUC =", num[2, 2])
) +
annotate("text",
x = .6, y = .12, size = 3, colour = "#cabbe9",
label = paste(num[3, 1], "AUC =", num[3, 2])
) +
annotate("text",
x = .6, y = .05, size = 3, colour = "#769fcd",
label = paste(num[4, 1], "AUC =", num[4, 2])
)
} else if (type == "prc") {
prc <- performance_df[performance_df$curvetype == "PRC", ]
p <- ggplot(prc, aes(x = x, y = y, group = Dataset)) +
theme_bw() +
geom_line(aes(color = Dataset), linewidth = 1) +
xlab("Recall") +
ylab("Precision") +
theme(
plot.title = element_text(hjust = 0.5),
panel.grid = element_blank(),
# line = element_line(color = "black", linewidth = 1.5,
# linetype = 1, lineend = "butt"),
axis.text.x = element_text(size = 14, color = "black"),
axis.text.y = element_text(size = 14, color = "black"),
axis.line = element_line(colour = "black"),
legend.position = "none",
title = element_text(size = 14),
legend.text = element_text(size = 14),
legend.title = element_text(size = 14)
) +
coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +
scale_color_manual(values = cols) +
annotate("segment",
x = 0, y = 1, xend = 1, yend = 0,
color = "gray", size = 0.5, linetype = "dashed"
) +
annotate("text",
x = .4, y = .26, size = 3, colour = "#ea5455",
label = paste(num[1, 1], "PR AUC =", num[1, 3])
) +
annotate("text",
x = .4, y = .19, size = 3, colour = "#3fbac2",
label = paste(num[2, 1], "PR AUC =", num[2, 3])
) +
annotate("text",
x = .4, y = .12, size = 3, colour = "#cabbe9",
label = paste(num[3, 1], "PR AUC =", num[3, 3])
) +
annotate("text",
x = .4, y = .05, size = 3, colour = "#769fcd",
label = paste(num[4, 1], "PR AUC =", num[4, 3])
)
}
return(p)
}
score_turn <- function(data, type) {
if (type == "soft") {
# data$ploidy <- max(data$ploidy)-data$ploidy
data$HRDCNAScore <- 1 - data$HRDCNAScore
data$scarHRD <- max(data$scarHRD) - data$scarHRD
data$Aneuploidy_score <- (max(data$Aneuploidy_score) - data$Aneuploidy_score) / 39
}
if (type == "hard") {
data$CNAfingerprint <- ifelse(data$CNAfingerprint > 0.58, 1, 0)
# data$ploidy <- ifelse(data$ploidy>=4,0,1)
data$HRDCNAScore <- ifelse(data$HRDCNAScore > 0.2, 0, 1)
data$scarHRD <- (data$scarHRD - min(data$scarHRD)) / (max(data$scarHRD) - min(data$scarHRD))
data$scarHRD <- 1 - data$scarHRD
scarcut <- quantile(data$scarHRD, probs = 0.75)
data$scarHRD <- ifelse(data$scarHRD > scarcut, 1, 0)
#
data$Aneuploidy_score <- data$Aneuploidy_score / 39
as_cut <- quantile(data$Aneuploidy_score, probs = 0.75)
data$Aneuploidy_score <- ifelse(data$Aneuploidy_score > as_cut, 0, 1)
data$`Tumor site` <- ifelse(data$`Tumor site` == "Right side", 0, 1)
}
return(data)
}
Take the example of calculating the RO AUC and PR AUC for Test1 cohort.
data <- read.csv("../data/all_test.csv", check.names = F)
#
data <- data[which(data$batch == "Test1"), ]
data_auc <- score_turn(data, "soft")
preform_list <- list()
for (i in c("CNAfingerprint", "HRDCNAScore", "scarHRD", "Aneuploidy_score")) {
a <- preform(data_auc, i)
preform_list <- c(preform_list, list(a))
}
# auc prc
library(patchwork)
p <- jyplot(preform_list, "auc")
p2 <- jyplot(preform_list, "prc")
p + p2
The ROAUC and PRAUC for all test cohorts.
for no-oxaliplatin cohort,we also get the AUC.
noOxa_cohort <- readRDS("../data/noOxa_cohort.rds")
data <- na.omit(noOxa_cohort)
data$lable <- ifelse(data$res == "PD", 0, 1)
plot.roc(
data$lable,
data$CNAfingerprint,
col = "#3d84a8",
percent = TRUE,
lwd = 2.5,
print.auc = TRUE,
print.auc.cex = 0.7,
print.auc.pattern = paste0("CNA fingerprint", ": %.1f%%"),
print.auc.y = 35
)
PART 4: Biomarker comparison
The performance of CNA fingerprint in oxaliplatin response prediction was compared with: –HRD status –KRAS mutation –aneuploidy –primary tumor location.
4.1 HRD status and aneuploidy
We use R package HRDCNA(Using the threshold values 0.2 recommended by the tool guideline. see 1.2 CNA features calling, function CNF_call) and scarHRD to assess the HRD status score;
# library(devtools)
# install_bitbucket('sequenza_tools/sequenza')
# install_github('sztup/scarHRD',build_vignettes = TRUE)
# Note that this R package installation is detailed at https://github.com/sztup/scarHRD
library(scarHRD)
# input format
# SampleID Chromosome Start_position End_position total_cn A_cn B_cn ploidy
# SamplePatient1 chr1 14574 952448 5 0 5
exampleSeg <- readRDS("../data/exampleSeg.rds")
# Assuming a ploidy of 5
ploidy <- 5
data <- data.frame(exampleSeg[, c(5, 1:4)],
A_cn = exampleSeg$segVal,
B_cn = 0,
ploidy = ploidy
)
colnames(data) <- c(
"SampleID", "Chromosome", "Start_position", "End_position",
"total_cn", "A_cn", "B_cn", "ploidy"
)
data %>%
group_by(SampleID) %>%
group_split() %>%
walk(~ {
group_name <- unique(.x$SampleID)
# save as csv
write.table(.x,
file = paste0("../data/scarHRD/", group_name, ".txt"),
row.names = FALSE, quote = F, sep = "\t"
)
})
csv_files <- list.files(
path = "../data/scarHRD/",
pattern = "\\.txt$", full.names = TRUE
)
score <- data.frame()
for (file in csv_files) {
sample <- gsub(".txt", "", basename(file))
scar_train <- scar_score(file, reference = "grch38", seqz = FALSE)
df <- data.frame(scar_train, sample = sample)
score <- rbind(score, df)
}
R package AneuploidyScore was used to calculate the aneuploidy score (total number of arm level gains/losses).
Note that this R package only supports hg19, so we first need to perform a genome version conversion
# devtools::install_github("quevedor2/aneuploidy_score", ref='master')
library(GenomicRanges)
library(liftOver)
library(AneuploidyScore)
cytoarm <- cytobandToArm(ucsc.hg19.cytoband)
ch <- import.chain("../data/hg38ToHg19.over.chain")
exampleSeg <- readRDS("../data/exampleSeg.rds")
# Assuming a ploidy of 5
ploidy <- 5
out <- data.frame()
for (i in unique(exampleSeg$sample)) {
segdata <- exampleSeg[exampleSeg$sample == i, ]
# ploidy=pp[which(pp$sample==i),"polidy"]
ploidy <- 5
seg <- GRanges(
# seqnames = paste0("chr",segdata$Chromosome),
seqnames = segdata$chromosome,
ranges = IRanges(start = segdata$start, end = segdata$end),
strand = "*",
TCN = segdata$segVal
)
hg19.seg <- liftOver(seg, ch)
seg_caa <- getCAA(hg19.seg, cytoarm,
tcn_col = "TCN", classifyCN = TRUE,
ploidy = ploidy, threshold = 0.5
)
caa <- reduceArms(seg_caa,
caa_method = c("arm", "seg"),
arm_ids = c("p", "q")
)
result <- data.frame(
sample = i,
t(caa[, 1]),
aneuploidy_score = colSums(abs(caa), na.rm = TRUE)
)
out <- rbind(out, result[1, ])
}
# Removal of sex chromosomes
aneuploidy_score <- rowSums(abs(out[, 2:45]), na.rm = TRUE)
aneuploidy_score <- data.frame(sample = out$sample, AS = aneuploidy_score)
4.2 Comparison and application
library(caret)
all_test <- read.csv("../data/all_test.csv", check.names = F)
# from change
score_turn <- function(data, type) {
if (type == "soft") {
data$HRDCNAScore <- 1 - data$HRDCNAScore
data$scarHRD <- max(data$scarHRD) - data$scarHRD
data$Aneuploidy_score <- (max(data$Aneuploidy_score) - data$Aneuploidy_score) / 39
}
if (type == "hard") {
data$CNAfingerprint <- ifelse(data$CNAfingerprint > 0.58, 1, 0)
data$HRDCNAScore <- ifelse(data$HRDCNAScore > 0.2, 0, 1)
data$scarHRD <- (data$scarHRD - min(data$scarHRD)) / (max(data$scarHRD) - min(data$scarHRD))
data$scarHRD <- 1 - data$scarHRD
scarcut <- quantile(data$scarHRD, probs = 0.75)
data$scarHRD <- ifelse(data$scarHRD > scarcut, 1, 0)
#
data$Aneuploidy_score <- data$Aneuploidy_score / 39
as_cut <- quantile(data$Aneuploidy_score, probs = 0.75)
data$Aneuploidy_score <- ifelse(data$Aneuploidy_score > as_cut, 0, 1)
data$`Tumor site` <- ifelse(data$`Tumor site` == "Right side", 0, 1)
}
return(data)
}
score <- c("CNAfingerprint", "HRDCNAScore", "scarHRD", "Aneuploidy_score", "Tumor site")
table <- data.frame(matrix(nrow = 12))
for (set in unique(all_test$batch)) {
data <- all_test[which(all_test$batch == set), ]
data <- score_turn(data, type = "hard")
for (i in score) {
df <- data[, c("lable", i)]
colnames(df)[2] <- "score"
cm <- caret::confusionMatrix(
data = factor(df$score, levels = c(0, 1)),
reference = factor(df$lable, levels = c(0, 1)),
positive = "1"
)
list <- as.data.frame(cm$byClass)
list["Accuracy", 1] <- cm$overall[1]
colnames(list)[1] <- paste0(set, "_", i)
table <- cbind(table, list)
}
}
table <- table[, -1]
saveRDS(table, file = "../data/allscore_CMtable.rds")
Heat maps
library(pheatmap)
allscore_CMtable <- readRDS("../data/allscore_CMtable.rds")
data <- t(allscore_CMtable[c(12, 1, 2, 5:7), ])
pheatmap(data,
cluster_rows = F,
cluster_cols = F,
angle_col = "45",
color = colorRampPalette(c("#3e92a3", "white", "#ff5335"))(100),
border_color = "white",
number_format = "%.2f",
display_numbers = round(data, 2),
legend = T
)
4.3 Compare to KRAS mutation
Limited by sample size, we will observe the predictive effect of KRAS mutations on chemotherapy response in all samples annotated with KRAS mutations in the 271 sample.
A total of 95 patients had KRAS mutations and 117 patients did not detect any KRAS,BRAF or NRAS mutations
data <- read.csv("../data/mut.csv")
data$type <- ifelse(data$Response.to.chemotherapy == "PD", 0, 1)
data$mut <- ifelse(data$Hotspot.mutation == "WT", 1, 0)
cm <- caret::confusionMatrix(
data = factor(data$type),
reference = factor(data$mut), positive = "1"
)
print(paste0("Accuracy of KRASmutation is ", cm$overall[1]))
[1] "Accuracy of KRASmutation is 0.518867924528302"
Sensitivity Specificity Pos Pred Value
0.5128205 0.5263158 0.5714286
Neg Pred Value Precision Recall
0.4672897 0.5714286 0.5128205
F1 Prevalence Detection Rate
0.5405405 0.5518868 0.2830189
Detection Prevalence Balanced Accuracy
0.4952830 0.5195682
PART 5: Overall survival
Exploring the Predictive Performance of CNA fingerprint for Patient Overall Survival.
Test3 cohort doesn’t have enough information for overall survival.
Overall Survival of test1 and test2 cohort
library(dplyr)
test_all <- read.csv("../data/all_test.csv", check.names = F)
data <- test_all %>%
filter(batch == "Test1") %>%
mutate(
prediction = factor(ifelse(CNAfingerprint > 0.58, "Responder", "Non-responder"),
levels = c("Non-responder", "Responder")
),
OS = OS / 30
)
fit <- survfit(Surv(`OS`, `OS_event`) ~ prediction, data = na.omit(data[, 13:15]))
ggsurvplot(
fit,
data = data[, 13:15],
size = 1,
conf.int = TRUE,
pval = TRUE,
pval.method = TRUE,
risk.table = TRUE,
palette = c("#3d6cb9", "#bc5148"),
risk.table.height = 0.25,
ggtheme = theme_classic2(),
xlab = "Months"
)
data <- test_all %>%
filter(batch == "Test2") %>%
mutate(
prediction = factor(ifelse(CNAfingerprint > 0.58, "Responder", "Non-responder"),
levels = c("Non-responder", "Responder")
),
OS = OS / 30
)
fit <- survfit(Surv(`OS`, `OS_event`) ~ prediction, data = na.omit(data[, 13:15]))
ggsurvplot(
fit,
data = data[, 13:15],
size = 1,
conf.int = TRUE,
pval = TRUE,
pval.method = TRUE,
risk.table = TRUE,
palette = c("#3d6cb9", "#bc5148"),
risk.table.height = 0.25,
ggtheme = theme_classic2(),
xlab = "Months"
)
Overall Survival of GSE36864 cohort
GSE36864, aCGH data of colorectal cancers of patients from 2 clinical trials (CAIRO, CAIRO2). 105 patients were treated with capecitabine first line (CAIRO arm A), 111 patients were treated with capecitabine and irinotecan first line (CAIRO arm B), and 133 patients were treated with capecitabine, oxaliplatin and bevacizumab (CAIRO2 arm A).
data <- readRDS("../data/GSE36864_clin.rds")
data$OS_event <- ifelse(data$`progression:ch1` == "yes", 1, 0)
colnames(data)[3] <- "OS"
data$OS <- as.numeric(data$OS) / 30
data$prediction <- ifelse(data$CNAfingerprint > 0.58, 1, 0)
Survival of patients with oxaliplatin-based regimens (GSE36864 cohort, CAIRO2 arm A).
df <- data[which(data$`study:ch1` == "CAIRO2"), ]
fit <- survfit(Surv(`OS`, `OS_event`) ~ prediction, data = na.omit(df))
ggsurvplot(
fit,
data = df,
size = 1,
conf.int = TRUE,
pval = TRUE,
pval.method = TRUE,
risk.table = TRUE,
palette = c("#3d6cb9", "#bc5148"),
risk.table.height = 0.25,
ggtheme = theme_classic2(),
xlab = "Months"
)
Survival of patients without oxaliplatin-based regimens (GSE36864 cohort, CAIRO).
df <- data[which(data$`study:ch1` == "CAIRO"), ]
fit <- survfit(Surv(`OS`, `OS_event`) ~ prediction, data = na.omit(df))
ggsurvplot(
fit,
data = df,
size = 1,
conf.int = TRUE,
pval = TRUE,
pval.method = TRUE,
risk.table = TRUE,
palette = c("#3d6cb9", "#bc5148"),
risk.table.height = 0.25,
ggtheme = theme_classic2(),
xlab = "Months"
)
Overall Survival of no-oxaliplatin cohort
noOxa_cohort <- readRDS("../data/noOxa_cohort.rds")
data <- na.omit(noOxa_cohort)
data$group <- factor(ifelse(data$CNAfingerprint > 0.58, "Responder", "Non-responder"),
levels = c("Non-responder", "Responder")
)
fit <- survfit(Surv(`OS`, `OS_event`) ~ group, data = data)
ggsurvplot(
fit,
data = data,
size = 1,
conf.int = TRUE,
pval = TRUE,
pval.method = TRUE,
risk.table = TRUE,
palette = c("#3d6cb9", "#bc5148"),
risk.table.height = 0.25,
ggtheme = theme_classic2(),
xlab = "Months"
)
Acknowledgments
We thank Fudan University Shanghai Cancer Center and Tongji hospital of Tongji University for sample collection;
We thank ShanghaiTech University High Performance Computing Public Service Platform for computing services. We thank multi-omics facility, molecular and cell biology core facility of ShanghaiTech University for technical help.
LICENSE
If you want to reuse the code in this report, please note the license below should be followed.
The code is made available for non commercial research purposes only under the MIT. However, notwithstanding any provision of the MIT License, the software currently may not be used for commercial purposes without explicit written permission after contacting Jinyu Wangwangjy10@shanghaitech.edu.cn or Tiyu Taotaozy@shanghaitech.edu.cn or Xue-Song Liu liuxs@shanghaitech.edu.cn.