Copy number alteration fingerprint predicts the clinical response of oxaliplatin-based chemotherapy in metastatic colorectal cancer
EQUAL CONTRIBUTION
J.W., J.W., Z.T. and T.W. contributed equally to this work as joint first authors.
X.L., X.L. and X.S.L contributed equally to this work as joint senior authors.
Introduction
Oxaliplatin-based chemotherapy is routinely used in metastatic colorectal cancer (mCRC) patient, however only about half of patients show clinical response after therapy. Accurate biomarkers for predicting oxaliplatin chemotherapy clinical response are still lacking. Here we developed and validated a tumor genomic DNA copy number alteration (CNA) based biomarker.
In total 271 mCRC patients who underwent primary tumor resection and postoperatively treated with oxaliplatin-based chemotherapy are included in this study. Tissue samples and their clinical data were obtained from the Fudan University Shanghai Cancer Center (FUSCC) and Tongji Hospital of Tongji University. The sample selection criteria are detailed here.
PART 1: Data preprocessing
To develop a stable, effective, and cost-efficient predictive biomarker for oxaliplatin-based chemotherapy response in metastatic colorectal cancer patients, we utilized shallow whole-genome sequencing(sWGS) data to extract copy number alteration(CNA) features. These features, combined with clinical characteristics, were used to construct a machine learning model, ultimately resulting in the development of a highly predictive model known as the CNA fingerprint.
1.1 Raw data processing
suppose i is the sample name. Using follow tools:
fastp
Quality control.
BWA MEM
Mapping.
Samtools
Transform sam to bam file.
1.2 CNA features calling
suppose you have got your bam file. Using follow tools:
QDNAseq
CBC segment,the defult genome is hg19,here we use hg38,and binsize as 100.
library(QDNAseq)
library(QDNAseq.hg38)
library(stringr)
library(ACE)
library(DNAcopy)
args <- commandArgs(trailingOnly = TRUE)
#args 1,sample name
#args 2,path to bam
#args 3,the output path
i <- args[1]
s <- paste0(i,"_sorted.bam")
#binsize: 100
bins <- getBinAnnotations(binSize=100,genome="hg38")
readCounts <- binReadCounts(bins,bamfile=paste0(args[2],"/",s))
readCountsFiltered <- applyFilters(readCounts,residual=TRUE, blacklist=TRUE)
readCountsFiltered <- estimateCorrection(readCountsFiltered)
copyNumbers <- correctBins(readCountsFiltered)
copyNumbersNormalized <- normalizeBins(copyNumbers)
copyNumbersSmooth <- smoothOutlierBins(copyNumbersNormalized)
copyNumbersSegmented <- segmentBins(copyNumbersSmooth, transformFun="log2")
copyNumbersSegmented <- normalizeSegmentedBins(copyNumbersSegmented)
save(copyNumbersSegmented,file="/your_path/copyNumbersSegmented.rda")
you can run like:
ACE
Fit the cellularity and ploidy
sqm <- squaremodel(copyNumbersSegmented, QDNAseqobjectsample = 1,
penalty = 0.65, penploidy = 0.65,
ptop = 5.3, pbottom = 1.8, prows = 250,
cellularities = seq(50, 100, length.out = 50),
sgc = c("X", "Y"),exclude=c())
# Here we get the purity ploidy and error values of all the fits, referring to the official ACE document that states “the minimum error is not necessarily the best fit”, it is recommended to select combinations with higher cellular purity in the similar minimum error set, and at the same time select combinations with higher ploidy based on the a priori knowledge of the genomes of colon cancer.
mdf <- sqm$minimadf
mdf <- mdf[order(mdf$error,-mdf$cellularity),]
# Manual screening and determination ploidy and cellularity,for example:
ploidy = 7.6
cellularity = 0.8
# Absoult CNA calling
m<-ACEcall(copyNumbersSegmented, QDNAseqobjectsample = TRUE, scellularity = cellularity,
ploidy = ploidy, standard =1, plot = FALSE,
qcutoff = -3, subcutoff = 0.2, trncname = FALSE,
cap = 12, bottom = 0,
onlyautosomes = TRUE
)
segment <- data.frame(copyNumbersSegmented@featureData@data[1:nrow(m),1:3],segment_mean=m$segment_mean)
segment <- na.omit(segment)
segment$segment_mean <- round(segment$segment_mean)
Sigminer
CN features calling
here we got the CNA segment files,first we do a filter,delect all segments with num_marks <10,make the file colnames as: chr,start,end segval,sample
We extracted CNA features described in our previous studies(2021 Wang et.al, 2023 Tao et.al) and also global CNA status indicator: CNA burden, amplification/deletion of each chromosome.
data <- readRDS("../data/exampleSeg.rds")
CNF_call <- function(data,note = NULL){
library(sigminer)
CN <- data[,c("chromosome", "start", "end", "segVal","sample")]
# Wang et al.
cn <- read_copynumber(CN,
seg_cols = c("chromosome", "start", "end", "segVal"),
samp_col = "sample",
genome_build = "hg38", complement = FALSE)
#
cn_sum <- cn@summary.per.sample
rownames(cn_sum) <- cn_sum$sample
tally_w <- sig_tally(cn, method = "W", cores = 10)
if(!is.null(note)){
saveRDS(tally_w,file = paste0("./data/results/",note,"_tally_W.rds"))
}
cn_fet <- tally_w[["nmf_matrix"]]
feature_m <- data.frame(cn_fet)
colnames(feature_m) <- colnames(cn_fet)
feature_m$sample <- rownames(cn_fet)
# if only one sample,rownames(cn_fet)will be null
colon_cnf <- merge(cn_sum,feature_m,by = "sample", all = T)
# Yao et al.
#HRD(cutoff 0.2)
library(HRDCNA)
score_yhz <- HRDprediction(data = feature_m)
colon_cnf <- merge(colon_cnf,score_yhz,by = "sample", all = T)
# Tao et al.
tally_X_noLOH <- sig_tally(cn,
method = "X",
add_loh = FALSE,
cores = 10
)
if(!is.null(note)){
saveRDS(tally_X_noLOH, file = paste0("./data/results/",note,"_tally_X_noLOH.rds"))
}
cn_fet <- tally_X_noLOH[["all_matrices"]][["simplified_matrix"]]
feature_m <- data.frame(cn_fet)
colnames(feature_m) <- colnames(cn_fet)
feature_m$sample <- rownames(cn_fet)
colon_cnf <- merge(colon_cnf,feature_m,by = "sample", all = T)
# global CNA status(amp and del states of every chr)
gCNA <- data.frame(matrix(nrow = 0,ncol = 44))
coln <- c()
for(i in 1:22) {
coln <- c(coln, paste0("chr", i, "[AMP]"), paste0("chr", i, "[DEL]"))
}
chrname <- paste0("chr",1:22)
for(i in unique(data$sample)){
df <- data[data$sample==i,]
onechr <- c(i)
for(j in chrname){
df1<- df[df$chromosome==j,]
if(any(df1$segVal >2)){
amp=1
}else{
amp=0
}
if(any(df1$segVal <2)){
del=1
}else{
del=0
}
onechr <- c(onechr,amp,del)
}
onechr <- data.frame(t(onechr))
colnames(onechr) <- c("sample",coln)
gCNA <- rbind(gCNA,onechr)
}
colon_cnf <- merge(colon_cnf,gCNA,by="sample")
return(colon_cnf)
}
features <- CNF_call(data,note = NULL)
saveRDS(features,file = "../data/examplefeatures.rds")
1.3 Feature distribution
TCGA data as an example, the distribution(W method and X method) of its features is demonstrated by function show_catalogue.
library(sigminer)
ace_tally_W <- readRDS("../data/ace_tally_W.rds")
tcga_tally_W <- readRDS("../data/tcga_tally_W.rds")
show_catalogue(t(ace_tally_W[, -1]), style = "cosmic", mode = "copynumber", x_label_angle = 90)
show_catalogue(t(tcga_tally_W[, -1]), style = "cosmic", mode = "copynumber", x_label_angle = 90)
ace_tally_X <- readRDS("../data/ace_tally_X.rds")
tcga_tally_X <- readRDS("../data/tcga_tally_X.rds")
show_catalogue(t(ace_tally_X[, -1]), style = "cosmic", mode = "copynumber", method = "T", x_label_angle = 90)
show_catalogue(t(tcga_tally_X[, -1]), style = "cosmic", mode = "copynumber", method = "T", x_label_angle = 90)
The CNA feature distribution in TCGA mCRCs.
The CNA feature distribution in Train cohort.
The CNA feature distribution in Train cohort.
The CNA feature distribution in Test1,Test2 and Test3 cohort can be presented in the same way.
PART 2: Model training
2.1 Internal benchmark
We performed internal benchmark to compare the performance of different CNA feature sets and 4 different machine learning models ( extreme gradient boosting (XGBoost), neural networks (NNET), random forest (RF), light gradient boosting machine (LGBM) ) in the training dataset.
The CNA information we compare here is the CNA signature reported by Drew et.al and Tao et.al.
Example calling process of CNA signature reported by Drews et.al.
# devtools::install_github("markowetzlab/CINSignatureQuantification", build_vignettes = TRUE, dependencies = TRUE)
# input format
# chromosome start end segVal sample
# 1 61735 249224388 3.1 TCGA-BT-A20P
library(tidyverse)
library(CINSignatureQuantification)
data <- readRDS("../data/exampleSeg.rds")
mySigs = quantifyCNSignatures(swgs_segment.ace_train)
D_sig <- mySigs@activities[["rawAct0"]]
#Low segment counts: Features not computed for 8 of 121 samples - see `getSampleFeatures()`
Example fitting process of CNA signature reported by Tao et.al.
# download pcawg_cn_sigs_CN136_SA from https://github.com/XSLiuLab/Pan-cancer_CNA_signature/data
examplefeatures <- readRDS("../data/examplefeatures.rds")
data <- examplefeatures[,c(89:224)]
m <- t(data)
colnames(m) <- examplefeatures$sample
T_sig <- sig_fit(as.matrix(m),
sig = pcawg_cn_sigs_CN136_SA$Signature,
method="SA",
return_class="matrix"
)
2.2 Model Training
We perform model parameter tuning via grid search, considering the vast number of combinations, it would be advisable to first conduct parameter searches for a subset, then utilize the optimized subset to search for other parameters, rather than running all parameters simultaneously.
Here, we omit the step of parameter searching and detailed data for other models, just directly present all the considered parameters.
- We use
scale_pos_weight_value
to address the issue of class imbalance. The weight is set as the sum of negative samples divided by the sum of positive samples.
library(xgboost)
library(caret)
library(Matrix)
train.mat <- readRDS("../data/train_mat.rds")
dtrain_mat <- train.mat
colnames(dtrain_mat)[1:(dim(dtrain_mat)[2] - 1)] <- paste0("feature", seq(1:(dim(dtrain_mat)[2] - 1)))
dtrain_input <- sparse.model.matrix(res ~ ., data = dtrain_mat)[, -1] %>%
as.matrix()
train_label <- dtrain_mat$res == "response"
control <- trainControl(
method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary
)
scale_pos_weight_value <- sum(dtrain_mat3$res == "nonresponse") / sum(dtrain_mat3$res == "response")
param_grid <- expand.grid(
max_depth = c(2:15),
eta = c(1e-5, 1e-4, 0.001, 0.01, 0.1),
max_delta_step = c(1:30),
alpha = c(0, 0.1, 0.5),
lambda = c(0, 1, 0.5),
gamma = c(0, 1, 5, 10),
scale_pos_weight = scale_pos_weight_value,
colsample_bytree = seq(0.1, 1, 0.1),
subsample = seq(0.5, 1, 0.1),
min_child_weight = c(1:10)
)
paramlist <- apply(param_grid, 1, as.list)
for (i in seq_along(paramlist)) {
x <- paramlist[[i]]
tryCatch(
{
set.seed(333)
xgb_last <- xgboost(
data = dtrain_input,
label = train_label,
objective = "binary:logistic",
params = x %>% as.list(),
nrounds = 1000,
early_stopping_rounds = 20,
eval_metric = "auc"
)
imp <- xgb.importance(model = xgb_last) %>%
as.data.frame() %>%
dplyr::arrange(desc(Gain))
headimp <- head(imp, n = 20)
headimp$Feature
## ---------------------------------------------------
set.seed(333)
out <- xgb.cv(
data = dtrain_input[, colnames(dtrain_input) %in% headimp$Feature] %>% as.matrix(),
label = train_label,
params = x %>% as.list(),
nrounds = 1000,
showsd = TRUE,
booster = "gbtree",
print_every_n = 5,
early_stopping_rounds = 100,
metrics = c("auc", "aucpr"),
verbose = TRUE,
nfold = 5,
objective = "binary:logistic"
)
result <- out[["evaluation_log"]]
result$best_nround <- out$best_iteration
result$params <- out$params %>% list()
result$params_input <- list(x)
result$i <- i
print(x)
print(paste0("Here is ", i))
if (max(out$evaluation_log$test_auc_mean) > 0.80) {
saveRDS(
result,
file = paste0("../data/iteronehot_", i, "_tune_param.rds")
)
saveRDS(
out,
file = paste0("../data/iteronehot_", i, "_out.rds")
)
}
},
error = function(e) {
message("Error occurred with parameters: ", x, ". Error message: ", e$message)
list(error_occurred = TRUE, params = x)
}
)
}
2.3 5-fold CV
The best model is selected based on cross-validation results. The parameters of final model as followed.
cvmodel <- readRDS("../data/iteronehot_1587_out.rds")
print(paste0("Best iteration : ",cvmodel$best_iteration))
[1] "Best iteration : 11"
$nrounds
[1] 100
$max_depth
[1] 3
$eta
[1] 0.1
$max_delta_step
[1] 8
$alpha
[1] 0
$lambda
[1] 1
$gamma
[1] 5
$scale_pos_weight
[1] 1.2
$colsample_bytree
[1] 0.7
$subsample
[1] 0.6
$min_child_weight
[1] 1
Attention, it has been tested that even with the same seed number, there may still be differences in feature importance output between different versions of the tool. To ensure result reproducibility, it is recommended to run this section using the same version, or alternatively, directly apply the final model provided by us.
library(gridExtra)
library(ggplot2)
CV5_evaluate <- cvmodel[["evaluation_log"]] %>%
head(50)
## AUROC
# Fit a curve by calculating the starting and ending positions of the bar chart.
fit_y1 <- predict(loess(train_auc_mean ~ iter, data = CV5_evaluate))
fit_y2 <- predict(loess(test_auc_mean ~ iter, data = CV5_evaluate))
df_bar.auroc <- data.frame(
x = CV5_evaluate$iter,
y1 = fit_y1,
y2 = fit_y2,
bar_y1 = CV5_evaluate$train_auc_std,
bar_y2 = CV5_evaluate$test_auc_std
)
iterauc_plot <- ggplot(CV5_evaluate, aes(x = iter)) +
geom_smooth(aes(y = train_auc_mean, color = "Train"), method = "loess", se = FALSE, size = 0.5) +
geom_smooth(aes(y = test_auc_mean, color = "Test"), method = "loess", se = FALSE, size = 0.5) +
geom_point(aes(y = train_auc_mean), color = "grey", alpha = 0) +
geom_point(aes(y = test_auc_mean), color = "grey", alpha = 0) +
geom_segment(data = df_bar.auroc, aes(x = x, xend = x, y = y1 - bar_y1 / 2, yend = y1 + bar_y1 / 2), color = "grey", size = 0.5) +
geom_segment(data = df_bar.auroc, aes(x = x, xend = x, y = y2 - bar_y2 / 2, yend = y2 + bar_y2 / 2), color = "grey", size = 0.5) +
scale_color_manual(values = c("Train" = "#005691", "Test" = "#ff9a3c")) +
labs(x = "Iterations", y = "Area Under the ROC Curve", color = "Lines") +
theme_bw() +
ylim(0.6, 1) +
#scale_y_continuous(breaks = seq(0, 1, by = 0.2), limits = c(0, 1)) +
geom_vline(xintercept = 16, linetype = "dashed", color = "red")
## AUCPR
fit_y1_pr <- predict(loess(train_aucpr_mean ~ iter, data = CV5_evaluate))
fit_y2_pr <- predict(loess(test_aucpr_mean ~ iter, data = CV5_evaluate))
df_bar.aucpr <- data.frame(
x = CV5_evaluate$iter,
y1 = fit_y1,
y2 = fit_y2,
bar_y1 = CV5_evaluate$train_aucpr_std,
bar_y2 = CV5_evaluate$test_aucpr_std
)
iteraucpr_plot <- ggplot(CV5_evaluate, aes(x = iter)) +
geom_smooth(aes(y = train_aucpr_mean, color = "Train"), method = "loess", se = FALSE, linewidth = 0.5) +
geom_smooth(aes(y = test_aucpr_mean, color = "Test"), method = "loess", se = FALSE, linewidth = 0.5) +
geom_point(aes(y = train_aucpr_mean), color = "grey", alpha = 0) +
geom_point(aes(y = test_aucpr_mean), color = "grey", alpha = 0) +
geom_segment(data = df_bar.aucpr, aes(x = x, xend = x, y = y1 - bar_y1 / 2, yend = y1 + bar_y1 / 2), color = "grey", size = 0.5) +
geom_segment(data = df_bar.aucpr, aes(x = x, xend = x, y = y2 - bar_y2 / 2, yend = y2 + bar_y2 / 2), color = "grey", size = 0.5) +
scale_color_manual(values = c("Train" = "#005691", "Test" = "#ff9a3c")) +
labs(x = "Iterations", y = "Area Under the Precision-Recall Curve", color = "Lines") +
theme_bw() +
ylim(0.6, 1) +
#scale_y_continuous(breaks = seq(0, 1, by = 0.2), limits = c(0, 1)) +
geom_vline(xintercept = 16, linetype = "dashed", color = "red")
CV results. Red line indicated the best iteration. Grey line indicated the standard deviation.
library(xgboost)
dtrain_input <- readRDS("../data/dtrain_input.rds")
train_label <- readRDS("../data/train_label.rds")
set.seed(333)
xgb_last_all <- xgboost(
data = dtrain_input,
label = train_label,
objective = "binary:logistic",
params = cvparam[[12]][[11]] %>% as.list(),
nrounds = 1000,
early_stopping_rounds = 20,
eval_metric = c("auc")
)
saveRDS(xgb_last_all, file = "../data/xgb_last_all.rds")
2.4 Feature selection and Final model
The feature importances in the model are ranked based on gains.
rm(list = ls())
library(tidyverse)
library(xgboost)
library(ggplot2)
xgb_last_all <- readRDS("../data/xgb_last_all.rds")
importance_all <- xgb.importance(model = xgb_last_all) %>%
as.data.frame() %>%
dplyr::arrange(desc(Gain)) %>%
head(n = 20L) %>%
dplyr::mutate(
Feature = case_when(
Feature == "feature84" ~ "CN[>8]",
Feature == "feature15" ~ "cna_burden",
Feature == "feature7" ~ "AFP",
Feature == "feature12" ~ "chr[20]AMP",
Feature == "feature70" ~ "NC50[>7]",
Feature == "feature74" ~ "CNCP[3]",
Feature == "feature81" ~ "CN[2]",
Feature == "feature5" ~ "CA19-9",
Feature == "feature42" ~ "L:LL:9+:BB",
Feature == "feature60" ~ "E:HH:2:BB",
Feature == "feature75" ~ "CNCP[>4 & <=8]",
Feature == "feature52" ~ "E:LL:9+:BB",
Feature == "feature47" ~ "E:LL:3:AA",
Feature == "feature59" ~ "E:HH:2:AA",
Feature == "feature3有" ~ "Cancerous node",
Feature == "feature53" ~ "E:LL:5-8:BB",
Feature == "feature73" ~ "CNCP[1]",
Feature == "feature38" ~ "L:HH:2:AA",
Feature == "feature83" ~ "CN[>4 & <=8]",
Feature == "feature16" ~ "SS[>8]",
.default = Feature
)
)
[17:53:02] WARNING: src/learner.cc:553:
If you are loading a serialized model (like pickle in Python, RDS in R) generated by
older XGBoost, please export the model by calling `Booster.save_model` from that version
first, then load it back in current version. See:
https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html
for more details about differences between saving model and serializing.
import_allgg <- ggplot(importance_all, aes(x = Gain, y = reorder(Feature, Gain))) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(x = "Feature importance", y = "") +
theme_bw() +
theme(
panel.grid.major = element_blank(),
panel.background = element_blank()
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
import_allgg
Next, we identify optimal number of features in prediction model. For combination of topN features (N range from 2 to 20), we calculated AUC from 50 times repeated five-fold cross-validation.
imp <- xgb.importance(model = xgb_last_all) %>%
as.data.frame() %>%
dplyr::arrange(desc(Gain))
rp_auc_df <- data.frame()
for(rp in 1:50){
set.seed(rp)
auc_df <- data.frame()
for(i in 2:20){
cv_results <- xgb.cv(
data = dtrain_input[, imp$Feature[1:i]] %>% as.matrix(),
label = train_label,
objective = "binary:logistic",
params = cvparam[[12]][[11]] %>% as.list(),
nrounds = 100,
nfold = 5,
early_stopping_rounds = 20,
metrics = "auc",
verbose = TRUE,
seed = rp
)
# Ignore data from the first 5 epoch to minimize error
averageauc <- mean(tail(cv_results$evaluation_log$test_auc_mean, -5))
averageaucsd <- mean(tail(cv_results$evaluation_log$test_auc_std, -5))
df <- data.frame(rp=rp,rank=i,auc=averageauc,sd=averageaucsd)
auc_df <- rbind(auc_df,df)
print(i)
}
rp_auc_df <- rbind(rp_auc_df,auc_df)
print(rp)
}
save(rp_auc_df,file = "../data/rp_auc_df.rda")
The feature combination with the highest average AUC values in the training dataset
load("../data/rp_auc_df.rda")
ggplot(rp_auc_df, aes(x = factor(rank), y = auc)) +
geom_violin(trim = FALSE, fill = "#3fbac2") +
geom_boxplot(width = 0.1, position = position_dodge(0.9), color = "#4d606e") +
labs(title = "Mean AUC of 5 fold CV(50 times) for different fatures",
x = "Rank",
y = "RO AUC") +
theme_bw()+
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
When N = 8, the model has the best average auc and sd. At this point, we get the final model.
set.seed(123)
xgb_last <- xgboost(
data = dtrain_input[, imp$Feature[1:8]] %>% as.matrix(),
label = train_label,
objective = "binary:logistic",
params = cvparam[[12]][[11]] %>% as.list(),
nrounds = 1000,
early_stopping_rounds = 100,
eval_metric = c("auc","aucpr")
)
saveRDS(xgb_last, file = "../data/xgb_last.rds")
2.5 Distribution of important features
Distribution of important features in different datasets.
library(ggpubr)
library(patchwork)
library(ggsci)
feature <- c("CN[>8]","E:LL:9+:BB","L:LL:9+:BB","batch","lable")
train <- read.csv("../data/Train.csv",check.names = F) %>%
select(all_of(feature))
train$batch <- "Train"
test <- read.csv("../data/all_test.csv",check.names = F) %>%
select(all_of(feature))
df <- rbind(train,test)
df$lable <- ifelse(df$lable==1,"response","Non response")
df$batch <- factor(df$batch,levels = c("Train","Test1","Test2","Test3"))
plots <- list()
for (col in names(df)[1:3]) {
p<-ggplot(df,aes_string(x = "batch", y = paste0("`", col, "`"), color = "lable"))+
geom_boxplot(aes(fill=lable),
alpha=0.1,outlier.shape = NA)+
geom_jitter(position = position_jitterdodge(jitter.height=0.3,
jitter.width = 0.3,
dodge.width = 0.5),
size = 0.4)+
scale_color_manual(values = c("#1e56a0","#f95959"))+
scale_fill_manual(values = c("#1e56a0","#f95959"))+
theme_bw()+
theme(panel.grid = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1))+
stat_compare_means(aes(group = lable),
method ="wilcox.test",
label="p.signif",
show.legend = F)+
xlab("")
plots[[col]] <- p
}
combined_plot <- wrap_plots(plots, ncol = 3)+
plot_layout(guides = "collect") & theme(legend.position = "top")
combined_plot
PART 3: Model evaluation
Final Model Evaluation
The performance of CNA fingerprint on three independent test sets.
AUC,PRAUC
Defining figure plotting functions.
library(precrec)
library(dplyr)
library(ggplot2)
library(ggpubr)
library(pROC)
preform <- function(score,i){
score <- score[which(!is.na(score[,i])),]
score$lable <- factor(score$lable,levels = c(0,1))
pre_obj1 <- mmdata(score[,i], score$lable,posclass = 1)
pre_obj1 <- evalmod(pre_obj1)
auc <- precrec::auc(pre_obj1)
pre1_df <- fortify(pre_obj1)
#if(i=="prediction"){i="OXACNAscore"}
if(i=="Aneuploidy_score"){i="Aneuploidy Score"}
#if(i=="ploidy"){i="WGD"}
pre1_df$Dataset <- i
pre1_df$auc <- round(auc$aucs[1],2)
pre1_df$prc <- round(auc$aucs[2],2)
return(pre1_df)
}
jyplot <- function(list,type){
num <- data.frame(matrix(ncol = 3, nrow = length(list)))
for (i in 1:length(list)) {
num[i,1] <- list[[i]]$Dataset[1]
num[i,2] <- list[[i]]$auc[1]
num[i,3] <- list[[i]]$prc[1]
}
#
performance_df <- Reduce(rbind, list)
cols <- c("CNAfingerprint" = "#ea5455",
"HRDCNAScore" = "#3fbac2",
"scarHRD" = "#cabbe9",
"Aneuploidy Score"="#769fcd")
if(type=="auc"){
roc <- performance_df[performance_df$curvetype == "ROC",]
p <- ggplot(roc, aes(x=x, y=y, group = Dataset)) +
theme_bw() +
geom_line(aes(color = Dataset),linewidth=1.2) +
xlab("1-Specificity") +
ylab("Sensitivity") +
theme(plot.title = element_text(hjust = 0.5),
panel.grid=element_blank(),
# line = element_line(color = "black", linewidth = 1.5,
# linetype = 1, lineend = "butt"),
axis.text.x = element_text(size=14,color = "black"),
axis.text.y = element_text(size=14,color = "black"),
axis.line = element_line(colour="black"),
legend.position = "none",
title=element_text(size=14),
) +
scale_color_manual(values = cols)+
guides(color=guide_legend(title="Dataset"))+
annotate("segment", x = 0, y = 0,xend = 1, yend = 1,
color = "gray", size = 0.5, linetype = "dashed")+
annotate("text",x = .6, y = .26,size=3, colour="#ea5455",
label=paste(num[1,1],"RO AUC =",num[1,2])) +
annotate("text",x = .6, y = .19,size=3, colour="#3fbac2",
label=paste(num[2,1],"RO AUC =",num[2,2]))+
annotate("text",x = .6, y = .12,size=3, colour="#cabbe9",
label=paste(num[3,1],"RO AUC =",num[3,2]))+
annotate("text",x = .6, y = .05,size=3, colour="#769fcd",
label=paste(num[4,1],"RO AUC =",num[4,2]))
}else if(type=="prc"){
prc <- performance_df[performance_df$curvetype == "PRC",]
p <- ggplot(prc, aes(x=x, y=y, group = Dataset)) +
theme_bw() +
geom_line(aes(color = Dataset),linewidth=1)+
xlab("Recall") +
ylab("Precision") +
theme(plot.title = element_text(hjust = 0.5),
panel.grid=element_blank(),
# line = element_line(color = "black", linewidth = 1.5,
# linetype = 1, lineend = "butt"),
axis.text.x = element_text(size=14,color = "black"),
axis.text.y = element_text(size=14,color = "black"),
axis.line = element_line(colour="black"),
legend.position = "none",
title=element_text(size=14),
legend.text = element_text(size=14),
legend.title = element_text(size = 14))+
coord_cartesian(xlim = c(0,1), ylim = c(0,1))+
scale_color_manual(values = cols)+
annotate("segment", x = 0, y = 1,xend = 1, yend = 0,
color = "gray", size = 0.5, linetype = "dashed")+
annotate("text",x = .4, y = .26,size=3, colour="#ea5455",
label=paste(num[1,1],"PR AUC =",num[1,3])) +
annotate("text",x = .4, y = .19,size=3, colour="#3fbac2",
label=paste(num[2,1],"PR AUC =",num[2,3]))+
annotate("text",x = .4, y = .12,size=3, colour="#cabbe9",
label=paste(num[3,1],"PR AUC =",num[3,3]))+
annotate("text",x = .4, y = .05,size=3, colour="#769fcd",
label=paste(num[4,1],"PR AUC =",num[4,3]))
}
return(p)
}
score_turn <- function(data,type){
if(type == "soft"){
#data$ploidy <- max(data$ploidy)-data$ploidy
data$HRDCNAScore <- 1-data$HRDCNAScore
data$scarHRD <- max(data$scarHRD)-data$scarHRD
data$Aneuploidy_score <- (max(data$Aneuploidy_score)-data$Aneuploidy_score)/39
}
if(type== "hard"){
data$CNAfingerprint <- ifelse(data$CNAfingerprint>0.33,1,0)
#data$ploidy <- ifelse(data$ploidy>=4,0,1)
data$HRDCNAScore <- ifelse(data$HRDCNAScore>0.2,0,1)
data$scarHRD <- (data$scarHRD- min(data$scarHRD))/(max(data$scarHRD)-min(data$scarHRD))
data$scarHRD <- 1-data$scarHRD
scarcut <- quantile(data$scarHRD,probs=0.75)
data$scarHRD <- ifelse(data$scarHRD > scarcut,1,0)
#
data$Aneuploidy_score <- data$Aneuploidy_score/39
as_cut <- quantile(data$Aneuploidy_score,probs=0.75)
data$Aneuploidy_score <- ifelse(data$Aneuploidy_score >as_cut,0,1)
data$`Tumor site` <- ifelse(data$`Tumor site`== 'Right side',0,1)
}
return(data)
}
Take the example of calculating the RO AUC and PR AUC for Test1 cohort.
data <- read.csv("../data/all_test.csv",check.names = F)
#
data <- data[which(data$batch=="Test1"),]
data_auc <- score_turn(data,"soft")
preform_list <- list()
for(i in c("CNAfingerprint","HRDCNAScore","scarHRD","Aneuploidy_score")){
a <- preform(data_auc,i)
preform_list <- c(preform_list, list(a))
}
#auc prc
library(patchwork)
p <- jyplot(preform_list,"auc")
p2 <- jyplot(preform_list,"prc")
p + p2
The ROAUC and PRAUC for all test cohorts.
for no-oxaliplatin cohort,we also get the AUC.
noOxa_cohort <- readRDS("../data/noOxa_cohort.rds")
data <- na.omit(noOxa_cohort)
data$lable <- ifelse(data$lable=="PD",0,1)
plot.roc(
data$lable,
data$CNAfingerprint,
col = "#3d84a8",
percent = TRUE,
lwd = 2.5,
print.auc = TRUE,
print.auc.cex = 0.7,
print.auc.pattern = paste0("CNA fingerprint",": %.1f%%"),
print.auc.y = 35
)
PART 4: Biomarker comparison
The performance of CNA fingerprint in oxaliplatin response prediction was compared with: –HRD status –KRAS mutation –aneuploidy –primary tumor location.
4.1 HRD status and aneuploidy
We use R package HRDCNA(Using the threshold values 0.2 recommended by the tool guideline. see 1.2 CNA features calling, function CNF_call) and scarHRD to assess the HRD status score;
# library(devtools)
# install_bitbucket('sequenza_tools/sequenza')
# install_github('sztup/scarHRD',build_vignettes = TRUE)
# Note that this R package installation is detailed at https://github.com/sztup/scarHRD
library(scarHRD)
#input format
#SampleID Chromosome Start_position End_position total_cn A_cn B_cn ploidy
#SamplePatient1 chr1 14574 952448 5 0 5
exampleSeg <- readRDS("../data/exampleSeg.rds")
# Assuming a ploidy of 5
ploidy=5
data <- data.frame(exampleSeg[,c(5,1:4)],
A_cn= exampleSeg$segVal,
B_cn = 0,
ploidy = ploidy)
colnames(data) <- c("SampleID","Chromosome","Start_position","End_position",
"total_cn","A_cn","B_cn","ploidy")
data %>%
group_by(SampleID) %>%
group_split() %>%
walk(~ {
group_name <- unique(.x$SampleID)
# save as csv
write.table(.x, file = paste0("../data/scarHRD/",group_name, ".txt"),
row.names = FALSE,quote = F,sep = "\t")
})
csv_files <- list.files(path = "../data/scarHRD/",
pattern = "\\.txt$", full.names = TRUE)
score <- data.frame()
for(file in csv_files){
sample <- gsub(".txt","",basename(file))
scar_train <- scar_score(file,reference = "grch38", seqz=FALSE)
df <- data.frame(scar_train,sample=sample)
score <- rbind(score,df)
}
# "sample" "HRD" "Telomeric AI" "LST" "HRD-sum"
# "sample" 0 13 4 17
R package AneuploidyScore was used to calculate the aneuploidy score (total number of arm level gains/losses).
Note that this R package only supports hg19, so we first need to perform a genome version conversion
# devtools::install_github("quevedor2/aneuploidy_score", ref='master')
library(GenomicRanges)
library(liftOver)
library(AneuploidyScore)
cytoarm <- cytobandToArm(ucsc.hg19.cytoband)
ch <- import.chain('../data/hg38ToHg19.over.chain')
exampleSeg <- readRDS("../data/exampleSeg.rds")
# Assuming a ploidy of 5
ploidy=5
out <- data.frame()
for (i in unique(exampleSeg$sample)) {
segdata <- exampleSeg[exampleSeg$sample==i,]
#ploidy=pp[which(pp$sample==i),"polidy"]
ploidy=5
seg <- GRanges(
#seqnames = paste0("chr",segdata$Chromosome),
seqnames = segdata$chromosome,
ranges = IRanges(start = segdata$start,end = segdata$end),
strand = "*",
TCN = segdata$segVal
)
hg19.seg <- liftOver(seg, ch)
seg_caa <- getCAA(hg19.seg, cytoarm, tcn_col="TCN", classifyCN=TRUE,
ploidy=ploidy, threshold=0.5)
caa <- reduceArms(seg_caa, caa_method=c('arm', 'seg'),
arm_ids = c('p', 'q'))
result <- data.frame(sample=i,
t(caa[,1]),
aneuploidy_score = colSums(abs(caa), na.rm = TRUE))
out <- rbind(out,result[1,])
}
#Removal of sex chromosomes
aneuploidy_score <- rowSums(abs(out[,2:45]),na.rm = TRUE)
aneuploidy_score <- data.frame(sample=out$sample,AS=aneuploidy_score)
4.2 Comparison and application
library(caret)
all_test <- read.csv("../data/all_test.csv",check.names = F)
# from change
score_turn <- function(data,type){
if(type == "soft"){
data$HRDCNAScore <- 1-data$HRDCNAScore
data$scarHRD <- max(data$scarHRD)-data$scarHRD
data$Aneuploidy_score <- (max(data$Aneuploidy_score)-data$Aneuploidy_score)/39
}
if(type== "hard"){
data$CNAfingerprint <- ifelse(data$CNAfingerprint>0.33,1,0)
data$HRDCNAScore <- ifelse(data$HRDCNAScore>0.2,0,1)
data$scarHRD <- (data$scarHRD- min(data$scarHRD))/(max(data$scarHRD)-min(data$scarHRD))
data$scarHRD <- 1-data$scarHRD
scarcut <- quantile(data$scarHRD,probs=0.75)
data$scarHRD <- ifelse(data$scarHRD > scarcut,1,0)
#
data$Aneuploidy_score <- data$Aneuploidy_score/39
as_cut <- quantile(data$Aneuploidy_score,probs=0.75)
data$Aneuploidy_score <- ifelse(data$Aneuploidy_score > as_cut,0,1)
data$`Tumor site` <- ifelse(data$`Tumor site`== 'Right side',0,1)
}
return(data)
}
score <- c("CNAfingerprint","HRDCNAScore","scarHRD","Aneuploidy_score","Tumor site")
table <- data.frame(matrix(nrow = 12))
for (set in unique(all_test$batch)) {
data <- all_test[which(all_test$batch==set),]
data <- score_turn(data,type="hard")
for (i in score) {
df <- data[,c("lable",i)]
colnames(df)[2] <- "score"
cm <- caret::confusionMatrix(data=factor(df$score,levels = c(0,1)),
reference=factor(df$lable,levels = c(0,1)),
positive="1")
list <- as.data.frame(cm$byClass)
list["Accuracy",1] <- cm$overall[1]
colnames(list)[1] <- paste0(set,"_",i)
table <- cbind(table,list)
}
}
table <- table[,-1]
saveRDS(table,file = "../data/allscore_CMtable.rds")
Heat maps
library(pheatmap)
allscore_CMtable <- readRDS("../data/allscore_CMtable.rds")
data <- t(allscore_CMtable[c(12,1,2,5:7),])
pheatmap(data,
cluster_rows=F,
cluster_cols = F,
angle_col = "45",
color = colorRampPalette(c("#3e92a3", "white", "#ff5335"))(100),
border_color = "white",
number_format = "%.2f",
display_numbers = round(data, 2),
legend=T
)
4.3 Compare to KRAS mutation
Limited by sample size, we will observe the predictive effect of KRAS mutations on chemotherapy response in all samples annotated with KRAS mutations in the 271 sample.
A total of 95 patients had KRAS mutations and 117 patients did not detect any KRAS,BRAF or NRAS mutations
data <- read.csv("../data/mut.csv")
data$type <- ifelse(data$Response.to.chemotherapy=="PD",0,1)
data$mut <- ifelse(data$Hotspot.mutation=="WT",1,0)
cm <- caret::confusionMatrix(data=factor(data$type),
reference=factor(data$mut),positive="1")
print(paste0("Accuracy of KRASmutation is ",cm$overall[1]))
[1] "Accuracy of KRASmutation is 0.518867924528302"
Sensitivity Specificity Pos Pred Value
0.5128205 0.5263158 0.5714286
Neg Pred Value Precision Recall
0.4672897 0.5714286 0.5128205
F1 Prevalence Detection Rate
0.5405405 0.5518868 0.2830189
Detection Prevalence Balanced Accuracy
0.4952830 0.5195682
PART 5: Overall survival
Exploring the Predictive Performance of CNA fingerprint for Patient Overall Survival.
Test3 cohort doesn’t have enough information for overall survival.
Overall Survival of test1 and test2 cohort
library(dplyr)
test_all <- read.csv("../data/all_test.csv",check.names = F)
data <- test_all[,-1] %>%
filter(batch=="Test1") %>%
mutate(prediction = factor(ifelse(CNAfingerprint > 0.33,"High CNAfingerprint","Low CNAfingerprint"),
levels = c("Low CNAfingerprint","High CNAfingerprint")),
OS = OS/30)
fit <- survfit(Surv(`OS`, `OS_event`) ~ prediction, data =na.omit(data[,c(21:23)]))
ggsurvplot(
fit,
data = data[,c(21:23)],
size = 1,
conf.int = TRUE,
pval = TRUE,
pval.method = TRUE,
risk.table = TRUE,
palette =c("#3d6cb9", "#bc5148"),
risk.table.height = 0.25,
ggtheme = theme_classic2(),
xlab = "Months"
)
data <- test_all[,-1] %>%
filter(batch=="Test2") %>%
mutate(prediction = factor(ifelse(CNAfingerprint > 0.33,"High CNAfingerprint","Low CNAfingerprint"),
levels = c("Low CNAfingerprint","High CNAfingerprint")),
OS = OS/30)
fit <- survfit(Surv(`OS`, `OS_event`) ~ prediction, data =na.omit(data[,c(21:23)]))
ggsurvplot(
fit,
data = data[,c(21:23)],
size = 1,
conf.int = TRUE,
pval = TRUE,
pval.method = TRUE,
risk.table = TRUE,
palette =c("#3d6cb9", "#bc5148"),
risk.table.height = 0.25,
ggtheme = theme_classic2(),
xlab = "Months"
)
Overall Survival of no-oxaliplatin cohort
noOxa_cohort <- readRDS("../data/noOxa_cohort.rds")
data <- na.omit(noOxa_cohort)
data$group <- factor(ifelse(data$CNAfingerprint > 0.33,"High CNAfingerprint","Low CNAfingerprint"),
levels = c("Low CNAfingerprint","High CNAfingerprint"))
fit <- survfit(Surv(`OS`, `OS_event`) ~ group, data =data)
ggsurvplot(
fit,
data = data,
size = 1,
conf.int = TRUE,
pval = TRUE,
pval.method = TRUE,
risk.table = TRUE,
palette =c("#3d6cb9", "#bc5148"),
risk.table.height = 0.25,
ggtheme = theme_classic2(),
xlab = "Months"
)
Acknowledgments
We thank Fudan University Shanghai Cancer Center and Tongji hospital of Tongji University for sample collection;
We thank ShanghaiTech University High Performance Computing Public Service Platform for computing services. We thank multi-omics facility, molecular and cell biology core facility of ShanghaiTech University for technical help.
LICENSE
If you want to reuse the code in this report, please note the license below should be followed.
The code is made available for non commercial research purposes only under the MIT. However, notwithstanding any provision of the MIT License, the software currently may not be used for commercial purposes without explicit written permission after contacting Jinyu Wangwangjy10@shanghaitech.edu.cn or Tiyu Taotaozy@shanghaitech.edu.cn or Xue-Song Liu liuxs@shanghaitech.edu.cn.