TLimmuno2: predicting MHC class II antigen immunogenicity through transfer learning
This report is written to help readers understand what and how we did in this project. Please read the formal manuscript <TLimmuno2: predicting MHC class II antigen immunogenicity through transfer learning> for more details.
This document is compiled from an Rmarkdown file which contains all code or description necessary to (auto-)reproduce the analysis for the accompanying project. Each section below describes a different component of the analysis and all numbers and figures are generated directly from the underlying data on compilation.
Dependencies
This project is depended on Python software, Python packages, R software, R packages and other tools :
Python:
- pandas - flexible and easy to use data analysis and manipulation tool
- numpy - a fundamental package for scientific computing with Python
- tensorflow - a open source platform for machine learning
- sklearn - machine learning in Python
- matplotlib - plot fig in python.
- seaborn - visualization library based on matplotlib
- sys,os,re - some base Python packages
R:
- reticulate - R Interface to Python
- NeoEnrichment - do neoantigen enrichment for this project.
- tidyverse - tidy data
- data.table - read and tidy data
- survival - survival analysis
- survminer - plot survival fit
- DT - show data table as a table in html
- patchwork - arrange figures into complex compound figures
- knitr, rmdformats - use to compile this file
Shell:
- netMHCIIpan(IEDB) - use to predict peptide binding affinity for MHCII molecules(download form IEBD, contain BA and EL method)
- IEDB tools - predict the allele independent CD4 T cell immunogenicity at population level
- MixMHC2pred - a pan-allele predictor of MHC class II ligands and epitopes
- Repitope - a structured framework of quantitative prediction of immunogenicity
- annovar - annotate genetic variants detected from diverse genomes
Summary figure

summary figure
Basic information
Here are some common scripts used in our project:
Of note, some step was done on Linux server provided by High Performance Computing(HPC) Service of ShanghaiTech University in order to reduce run time. So some scripts was sunmitted by using Portable Bash Script (PBS) scripts, we will mark them in the header by using
#PBS
. Also, there are mixture of R and Python, if readers want to reproduce this work, please focus on the key code lines.
1.PBS script format: we change name
,
out file path
and python file
in different
work in order to submit it into HPC. File submitted by PBS will not be
compiled by this report.
#PBS -N <name>
#PBS -o <out file path>
#PBS -e <out file path>
#PBS -l nodes=1:ppn=1
#PBS -l walltime=20000:00:00
#PBS -S /bin/bash
#PBS -q pub_gpu
if [ -f "/public/slst/home/wanggsh/miniconda3/etc/profile.d/conda.sh" ]; then
. "/public/slst/home/wanggsh/miniconda3/etc/profile.d/conda.sh"
else
export PATH="/public/slst/home/wanggsh/miniconda3/bin:$PATH"
fi
conda activate tensorflow
python <Python file>
2.BLOSUM62 encoder: we use BLOSUM62 matrix to encoder peptides and MHC pseudo-sequence into numeric. Blosum62 Matrix were download from NCBI website. The animo acid alphabet is ACDEFGHIKLMNPQRSTVWYX, X is the general wildcard character. You can find these content in USCS and Mathworks. The following is the script:
#python
import pandas as pd
import numpy as np
= pd.read_csv(r"~/Data/MHCII/BLOSUM62.csv",comment="#")
Blosum62_matrix = list("ARNDCQEGHILKMFPSTWYVX")
Protein_alphabet = Blosum62_matrix[Protein_alphabet]
Blosum62_matrix = Blosum62_matrix.loc[Protein_alphabet]
Blosum62_matrix
Blosum62_matrix
def blosum62(peptide,maxlen):
= np.empty((maxlen,21))
encoder if len(peptide) <=maxlen:
= peptide + "X"*(maxlen-len(peptide))
peptide for i in range(len(peptide)):
= list(peptide)[i]
pep = Blosum62_matrix[pep]
coder = coder
encoder[i] return encoder.flatten()
if __name__ == "__main__":
blosum62(peptide,maxlen)
3.Creating a batch of PBS job:In some steps, the data is to large,
even the HPC need a lot of time to finish. So we create a batch of PBS
work to speed up the analysis process. These script were marked by using
PBS_batch
in the header.File submitted by PBS will not be
compiled by this report.
#python
import os
= <file path>
file_path file = os.listdir(file_path)
= <pbs path>
pbs_path for i in file:
= r"""#PBS -N {0}
pbs_text #PBS -o <out file path>
#PBS -e <out file path>
#PBS -l nodes=1:ppn=20
#PBS -l walltime=20000:00:00
#PBS -S /bin/bash
#PBS -q pub_gpu
if [ -f "/public/slst/home/wanggsh/miniconda3/etc/profile.d/conda.sh" ]; then
. "/public/slst/home/wanggsh/miniconda3/etc/profile.d/conda.sh"
else
export PATH="/public/slst/home/wanggsh/miniconda3/bin:$PATH"
fi
conda activate tensorflow
python <Python file>
""".format(i)
"echo \"{0}\" > {1}/{2}.pbs".format(pbs_text,pbs_path,i))
os.system("qsub {0}/{1}.pbs".format(pbs_path,i))
os.system(print("{} submitted".format(i))
4.Background sequence: TLimmuno2 outputs a continuous variable between 0 and 1 and also percentile rank. we download all human peptide sequence from UniPort database(2022.3) and randomly sampling peptides. We generated a set of 90,000 13-21-mer random natural peptides (10,000 of each length). For each peptide-MHC II pair, percentile ranks were computed based on background sequence, smaller rank indicates stronger immunogenicity. The result maybe different as the reason of random select.
#python
import pandas as pd
import numpy as np
import random
#uniport data
=open("../data/uniport_human.fasta")
f=[]
lsfor line in f:
if not line.startswith('>'):
'\n',''))
ls.append(line.replace(
f.close()= pd.DataFrame(ls)
uniport "length"] = uniport[0].map(len)
uniport[bool = uniport["length"]>30
= uniport[bool]
uniport bool = uniport[0].apply(lambda x: set(list(x)).issubset(set(list("ARNDCQEGHILKMFPSTWYVX"))))
= uniport[bool]
uniport = uniport[0].values
uniport_random def random_select(x,length,num,allele):
= []
df all = []
for i in range(num):
= random.randint(0,len(x)-1)
protein_num = x[protein_num]
protein = random.randint(0,len(protein)-length)
peptide_num = protein[peptide_num:peptide_num+length]
peptide
df.append(peptide)all.append(allele)
return df,all
= []
Pep for i in range(13,21):
all = random_select(uniport_random,i,10000,"ALL")
pep,
Pep.extend(pep)= pd.DataFrame({"pep":Pep})
IMM_bg_pep #IMM_bg_pep.to_csv("../data/IMM_bg_pep.csv")
5.ROC significant: we calculated the p-value between the two ROC’s by Delong’s test.
import numpy as np
from matplotlib import pyplot as plt
import scipy.stats as st
from sklearn import metrics
class DelongTest():
def __init__(self,preds1,preds2,label,threshold=0.05):
'''
preds1:the output of model1
preds2:the output of model2
label :the actual label
'''
self._preds1=preds1
self._preds2=preds2
self._label=label
self.threshold=threshold
self.show_result()
def _auc(self,X, Y)->float:
return 1/(len(X)*len(Y)) * sum([self._kernel(x, y) for x in X for y in Y])
def _kernel(self,X, Y)->float:
'''
Mann-Whitney statistic
'''
return .5 if Y==X else int(Y < X)
def _structural_components(self,X, Y)->list:
= [1/len(Y) * sum([self._kernel(x, y) for y in Y]) for x in X]
V10 = [1/len(X) * sum([self._kernel(x, y) for x in X]) for y in Y]
V01 return V10, V01
def _get_S_entry(self,V_A, V_B, auc_A, auc_B)->float:
return 1/(len(V_A)-1) * sum([(a-auc_A)*(b-auc_B) for a,b in zip(V_A, V_B)])
def _z_score(self,var_A, var_B, covar_AB, auc_A, auc_B):
return (auc_A - auc_B)/((var_A + var_B - 2*covar_AB )**(.5)+ 1e-8)
def _group_preds_by_label(self,preds, actual)->list:
= [p for (p, a) in zip(preds, actual) if a]
X = [p for (p, a) in zip(preds, actual) if not a]
Y return X, Y
def compute_z_p(self):
= self._group_preds_by_label(self._preds1, self._label)
X_A, Y_A = self._group_preds_by_label(self._preds2, self._label)
X_B, Y_B
= self._structural_components(X_A, Y_A)
V_A10, V_A01 = self._structural_components(X_B, Y_B)
V_B10, V_B01
= self._auc(X_A, Y_A)
auc_A = self._auc(X_B, Y_B)
auc_B
# Compute entries of covariance matrix S (covar_AB = covar_BA)
= (self._get_S_entry(V_A10, V_A10, auc_A, auc_A) * 1/len(V_A10)+ self._get_S_entry(V_A01, V_A01, auc_A, auc_A) * 1/len(V_A01))
var_A = (self._get_S_entry(V_B10, V_B10, auc_B, auc_B) * 1/len(V_B10)+ self._get_S_entry(V_B01, V_B01, auc_B, auc_B) * 1/len(V_B01))
var_B = (self._get_S_entry(V_A10, V_B10, auc_A, auc_B) * 1/len(V_A10)+ self._get_S_entry(V_A01, V_B01, auc_A, auc_B) * 1/len(V_A01))
covar_AB
# Two tailed test
= self._z_score(var_A, var_B, covar_AB, auc_A, auc_B)
z = st.norm.sf(abs(z))*2
p
return z,p
def show_result(self):
=self.compute_z_p()
z,preturn z,p
#print(f"z score = {z:.5f};\np value = {p:.5f};")
#if p < self.threshold :print("There is a significant difference")
#else: print("There is NO significant difference")
The construction process of the BA model
This part focuses on Binding affinity (BA) model construction. it contains the following sections:
- Training data - Data download and process
- BA model - constructed BA model by LSTM
- BA model perforemance - Compared the performance of BA model with other models on independent dataset.
Training data
Data download
The same binding affinity data used for training NetMHCIIpan were used to train our model,you can find the data in IEDB, the details are described as below:
1.data processing: combine & drop duplicates
#R
<- "../data/NetMHCIIpan_train/"
path <- function(name,path) {
read_file = tibble()
df for (i in 1:5) {
<- data.table::fread(str_c(path,name,i,".txt"),data.table = F,header = F)
x <- rbind(df,x)
df
}return(df)
}<- read_file("train_BA",path)
train_BA <- read_file("test_BA",path)
test_BA
<- data.table::fread("../data/pseudosequence.2016.all.X.dat",data.table = F,header = F)#pseudo seq for MHC
pseudo_sequence <- rbind(train_BA,test_BA)
BA <- BA %>% filter(str_detect(V3,"D")) %>% select(V1,V2,V3)
BA_filtered colnames(pseudo_sequence) <- c("MHC","sequence")
colnames(BA_filtered) <- c("peptide","IC50","MHC")
<- left_join(BA_filtered,pseudo_sequence)
BA_filtered_MHC #> Joining, by = "MHC"
#na.omit(BA_filter_MHC)
<- unique(BA_filtered_MHC)
BA_filtered_MHC dim(BA_filtered_MHC)
#> [1] 107008 4
length(table(unique(BA_filtered_MHC$MHC)))
#> [1] 71
These data comprise 107,008 measurements of peptide–MHC binding affinity covering 71 types of class II MHCs from humans.
2.change normalized IC50 to binary label: IC50 is normalized by log50k transformed binding affinity (i.e. 1 - log50k( aff nM)) method, and a threshold of 500 nM is used. This means that peptides with log50k transformed binding affinity values greater than 0.426 are classified as binders(true label: 1)
#R
"label"] = ifelse(BA_filtered_MHC$IC50>0.426,1,0) #set positive and negative
BA_filtered_MHC[write_csv(BA_filtered_MHC,"../data/BA_MHC_label.csv")
<- unique(BA_filtered_MHC$MHC)
BA_allele write_lines(BA_allele,"../data/BA_allele.csv")
Show data feature
#python
= pd.read_csv("../data/BA_MHC_label.csv")
BA "length"] = BA["peptide"].map(len)
BA[= BA[BA["label"] == 1]
BA_positive = BA[BA["label"] == 0]
BA_negative
def MHC_dis_plot(Data,color,colname):
= Data[colname].value_counts().values
value = Data[colname].value_counts().index
index = value/sum(value)
norm_value = list(index[0:15])
a = list(norm_value[0:15])
b "other ({})".format(len(value)-15))
a.append(sum(norm_value[15:]))
b.append(= plt.subplots(figsize = (8,7))
fig,ax = color)
plt.bar(a,b,color 0,0.45])
plt.ylim([0.00,0.05,0.10,0.15,0.20,0.25,0.30,0.35,0.40])
plt.yticks([= 90) plt.xticks(rotation
We show the distribution of allele in different labels:
#python
"green","MHC")
MHC_dis_plot(BA_positive,
plt.tight_layout()"../figure/BA_MHC_positive.pdf",dpi = 300,transparent=True)
plt.savefig( plt.show()
#python
"red","MHC")
MHC_dis_plot(BA_negative,
plt.tight_layout()"../figure/BA_MHC_negative.pdf",dpi = 300,transparent=True)
plt.savefig( plt.show()
The converted affinity value of negative samples is around 0.2, while that of positive samples is around 0.6:
#python
= BA,x="label",y="IC50")
sns.boxplot(data -0.1,1.1])
plt.ylim([#> (-0.1, 1.1)
"Label")
plt.xlabel("Binding affinity")
plt.ylabel(0,1.2,0.2))
plt.yticks(np.arange(#> ([<matplotlib.axis.YTick object at 0x28e989970>, <matplotlib.axis.YTick object at 0x28e989670>, <matplotlib.axis.YTick object at 0x28eddcd00>, <matplotlib.axis.YTick object at 0x28ebdeeb0>, <matplotlib.axis.YTick object at 0x28e991940>, <matplotlib.axis.YTick object at 0x28e991df0>], [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])
0,1),labels = ["Negative","Positive"])
plt.xticks((#> ([<matplotlib.axis.XTick object at 0x28ebe3850>, <matplotlib.axis.XTick object at 0x28ebe3670>], [Text(0, 0, 'Negative'), Text(1, 0, 'Positive')])
"../figure/BA_label_boxplot.pdf",dpi = 300,transparent=True)
plt.savefig( plt.show()
15mer peptides account for the largest proportion of Binding affinity data.
#python
= list(BA_positive["length"].value_counts().values)
pos_len_value = BA_positive["length"].value_counts().index
pos_len_index = BA_positive["length"].value_counts().values/sum(BA_positive["length"].value_counts())
norm_pos_len_value = list(BA_negative["length"].value_counts().values)
neg_len_value = BA_negative["length"].value_counts().index
neg_len_index = BA_negative["length"].value_counts().values/sum(BA_negative["length"].value_counts())
norm_neg_len_value = plt.subplots()
fig,ax = 0.35
width = pos_len_index
x - width/2, norm_pos_len_value,width,label='immunogenic',color = "green")
ax.bar(pos_len_index #> <BarContainer object of 9 artists>
+ width/2, norm_neg_len_value, width, label='noimmunogenic',color = "red")
ax.bar(neg_len_index #> <BarContainer object of 9 artists>
'proportion')
ax.set_ylabel('length')
ax.set_xlabel('Distribution of Length')
ax.set_title(
ax.set_xticks(x)
ax.legend()"../figure/length_dis_label.pdf",dpi = 300,transparent=True)
plt.savefig( plt.show()
BA model training
Unlike NetMHCIIpan4.0, we use the LSTM layer to construct our model. Antigen sequences and MHC pseudo-sequences are used as inputs. We used two LSTM layers with output sizes of 128,64 on top of the antigen and MHC inputs. The LSTM outputs of antigen and MHC are concatenated in the same layer to form a 128-dimensional vector. This layer is followed by three dense layers with 100, 50, and 10 neurons activated by RELU, and the last output layer is a single neuron dense layer activated by sigmoid.
The input of this model is the peptides sequence and the MHC sequence. The output is whether the peptides bind to the MHC molecule or not. We used 5-fold cross-validation and the training results are shown in training histrory.
Structure
Here are the script that are used to train the BA model, this job is done on HPC:
#PBS
#python
import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split,KFold
from sklearn.metrics import auc,precision_recall_curve,roc_curve,confusion_matrix
import sys
"/public/slst/home/wanggsh/Project/MHCII/utils")
sys.path.append(from NN_data import NN_Data
from Blosum62 import blosum62
import matplotlib.pyplot as plt
from collections import Counter
import os
= 2
version
= pd.read_csv("/public/slst/home/wanggsh/Data/MHCII/netMHCpanII4.0_train_data/BA_data/BA_MHC_label.csv")
BA = BA.drop_duplicates()
BA = BA.reset_index(drop=True)
BA #pep
"pep_blosum"] = BA["peptide"].apply(blosum62,args=(21,))
BA["MHC_blosum"] = BA["sequence"].apply(blosum62,args=(34,))
BA[= np.empty((len(BA),21,21))
peptide for i in range(len(BA)):
= BA["pep_blosum"][i].reshape((21,21))
peptide[i] #mhc
= np.empty((len(BA),34,21))
MHC for i in range(len(BA)):
= BA["MHC_blosum"][i].reshape((34,21))
MHC[i] #score
= BA["score"].values
labels
#class weight
= np.bincount(BA["score"])
neg, pos = neg + pos
total print('Examples:\n Total: {}\n Positive: {} ({:.2f}% of total)\n'.format(
100 * pos / total))
total, pos, = (1 / neg) * (total / 2.0)
weight_for_0 = (1 / pos) * (total / 2.0)
weight_for_1 = {0: weight_for_0, 1: weight_for_1}
class_weight print('Weight for class 0: {:.2f}'.format(weight_for_0))
print('Weight for class 1: {:.2f}'.format(weight_for_1))
#split test
= train_test_split(peptide,MHC,labels,test_size=0.1,random_state=202205,stratify=labels)#random_state=202205
pep_Train,pep_test,MHC_Train,MHC_test,label_Train,label_test print(Counter(label_Train))
print(Counter(label_test))
def ELmodel():
= tf.keras.Input(shape = (21,21))
inputA = tf.keras.Input(shape = (34,21))
inputB
= tf.keras.layers.LSTM(128,return_sequences=True)(inputA)
x = tf.keras.layers.LSTM(64,return_sequences=True)(x)
x = tf.keras.layers.Flatten()(x)
x = tf.keras.Model(inputs = inputA,outputs = x)
x
= tf.keras.layers.LSTM(128,return_sequences=True)(inputB)
y = tf.keras.layers.LSTM(64,return_sequences=True)(y)
y = tf.keras.layers.Flatten()(y)
y = tf.keras.Model(inputs = inputB,outputs = y)
y
= tf.keras.layers.concatenate([x.output,y.output])
combined
= tf.keras.layers.Dense(100,activation="relu")(combined)
z = tf.keras.layers.Dense(50,activation = "relu")(z)
z = tf.keras.layers.Dense(10,activation = "relu")(z)
z = tf.keras.layers.Dense(1,activation = "sigmoid")(z)
z
= tf.keras.Model(inputs = [inputA,inputB],outputs = z)
model
return model
= tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)
callback = keras.losses.BinaryCrossentropy()
Loss = keras.optimizers.Adam(learning_rate = 0.00008)
Optimizer = [tf.keras.metrics.AUC(),tf.keras.metrics.BinaryAccuracy(),tf.keras.metrics.AUC(curve = "PR"),tf.keras.metrics.Precision()]
Metrics = 512
Batch_size= 100
Epochs= 2 #reduce information output
Verbose
print("Cross Validation")
= KFold(n_splits=5,shuffle = True)
kf = {}
ROC = {}
PR = 1
i for train_index,val_index in kf.split(label_Train):
= pep_Train[train_index],MHC_Train[train_index],label_Train[train_index]
pep_train,MHC_train,label_train = pep_Train[val_index],MHC_Train[val_index],label_Train[val_index]
pep_val,MHC_val,label_val print(Counter(label_train))
print(Counter(label_val))
= ELmodel()
model compile(
model.= Loss,
loss = Optimizer,
optimizer = Metrics)
metrics #callback
= model.fit([pep_Train,MHC_Train],label_Train,
history =Batch_size,epochs=Epochs,
batch_size= ([pep_val,MHC_val],label_val),
validation_data = callback,verbose = Verbose,
callbacks =class_weight)
class_weight= pd.DataFrame(history.history)
His_df "/public/slst/home/wanggsh/Project/Saved_model/BA_model_history/corss_history{0}.csv".format(i))
His_df.to_csv(= model.predict([pep_val,MHC_val])
prediction = roc_curve(label_val,prediction,pos_label=1)
fpr,tpr,thresholds = auc(fpr,tpr)
area_mine "flod{}".format(i)] = [fpr,tpr,thresholds,area_mine]
ROC[print("=============={} auc data finish==========".format(i))
= precision_recall_curve(label_val,prediction,pos_label=1)
precision,recall,thresholds = auc(recall,precision)
area_PR "flod{}".format(i)] = [recall,precision,thresholds,area_PR]
PR[print("=============={} pr data finish==========".format(i))
= i+1
i
#----plot cross fig
= pd.DataFrame(ROC)
ROC_df = plt.figure()
fig = 2
lw = fig.add_subplot(111)
ax 0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
ax.plot([for i in range(5):
0,i],ROC_df.iloc[1,i],label='flod{0} (area = {1:.4f})'.format(i,ROC_df.iloc[3,i]))
ax.plot(ROC_df.iloc[0,1.05])
ax.set_ylim([0,1])
ax.set_xlim(['False Positive Rate')
ax.set_xlabel('True Positive Rate')
ax.set_ylabel('Receiver operating characteristic example')
ax.set_title(="lower right")
ax.legend(loc"/public/slst/home/wanggsh/Project/Saved_model/Model_fig/BA_model/BA_ROC_cross.pdf",dpi = 300)
plt.savefig(= pd.DataFrame(PR)
PR_df
plt.figure()= 2
lw for i in range(5):
0,i],PR_df.iloc[1,i],lw=lw, label = 'flod{0} (area = {1:.4f})'.format(i,PR_df.iloc[3,i]))
plt.plot(PR_df.iloc[0.0, 1.0])
plt.xlim([0.0, 1.05])
plt.ylim(['Recall')
plt.xlabel('Precision')
plt.ylabel('PR curve')
plt.title(="lower right")
plt.legend(loc"/public/slst/home/wanggsh/Project/Saved_model/Model_fig/BA_model/BA_PR_cross.pdf",dpi = 300)
plt.savefig(print("start model training")
= ELmodel()
model
compile(
model.= Loss,
loss = Optimizer,
optimizer = Metrics)
metrics #callback
= model.fit([pep_Train,MHC_Train],label_Train,
history =Batch_size,epochs=Epochs,
batch_size= ([pep_test,MHC_test],label_test),
validation_data = callback,verbose = Verbose,
callbacks =class_weight)
class_weightr"/public/slst/home/wanggsh/Project/Saved_model/new_BA_model/model1")
model.save(
print("start plt fig")
= pd.DataFrame(history.history)
His_df "/public/slst/home/wanggsh/Project/Saved_model/BA_model_history/history.csv")
His_df.to_csv(
= model.predict([pep_test,MHC_test])
prediction = model.predict([pep_Train,MHC_Train])
Train_pre = roc_curve(label_test,prediction,pos_label=1)
fpr,tpr,thresholds = roc_curve(label_Train,Train_pre,pos_label=1)
train_fpr,train_tpr,_ = auc(fpr,tpr)
area_mine = auc(train_fpr,train_tpr)
area_mine_train print(area_mine)
print(area_mine_train)
= precision_recall_curve(label_test,prediction,pos_label=1)
precision,recall,thresholds = precision_recall_curve(label_Train,Train_pre,pos_label=1)
precision_train,recall_train,_ = auc(recall,precision)
area_PR = auc(recall_train,precision_train)
area_PR_train
= plt.figure()
fig = 2
lw = fig.add_subplot(111)
ax 0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
ax.plot([='Train AUC = (area = {0:.4f})'.format(area_mine_train))
ax.plot(train_fpr,train_tpr,label='Test AUC = (area = {0:.4f})'.format(area_mine))
ax.plot(fpr,tpr,label0,1.05])
ax.set_ylim([0,1])
ax.set_xlim(['False Positive Rate')
ax.set_xlabel('True Positive Rate')
ax.set_ylabel('Receiver operating characteristic example')
ax.set_title(="lower right")
ax.legend(loc"/public/slst/home/wanggsh/Project/Saved_model/Model_fig/BA_model/ROC.pdf",dpi = 300)
plt.savefig(
plt.figure()= 2
lw =lw, label = 'Train AUC = (area = {0:.4f})'.format(area_PR_train))
plt.plot(recall_train,precision_train,lw=lw, label = 'Test AUC = (area = {0:.4f})'.format(area_PR))
plt.plot(recall,precision,lw0.0, 1.0])
plt.xlim([0.0, 1.05])
plt.ylim(['Recall')
plt.xlabel('Precision')
plt.ylabel('PR curve example')
plt.title(="lower right")
plt.legend(loc
"/public/slst/home/wanggsh/Project/Saved_model/Model_fig/BA_model/PR.pdf",dpi = 300)
plt.savefig(
print("finish")
Train history fig
We use 5-fold cross validation to validate our model. Cross validation: The training results of the model are very stable, the average AUC of ROC is 0.97, and the average AUC of PR is 0.96.

BA corss ROC

BA cross PR
held out dataset:

BA ROC

BA PR
BA model perforemance
We compare the performance of our BA with other models on an independent dataset. we compare netMHCIIpan 4.0 BA and mixMHC2pred.
Benchmark data preprecess
We collected independent datasets Automated from Andreatta et al. and drop duplicated data with training dataset. we also change binding affinity into binary label by log50k transformed method.
#R
<- read_table("../data/mhcii_initial_benchmark_datasets.tsv")
Dataset <- Dataset %>% filter(measurement_type == "binary")
Dataset_binary <- Dataset %>% filter(measurement_type == "ic50")
Dataset_ic50 <- function(x){
fun if(x<=500){y=1}
else{y=0}
return(y)
}<- sapply(Dataset_ic50$measurement_value, fun)
quativate $measurement_value <- quativate
Dataset_ic50<- rbind(Dataset_binary,Dataset_ic50)
Dataset_final <- read_table("../data/pseudosequence.2016.all.X.dat",col_names = c("allele","sequence"))
pseduo_seq <- function(x){
fun if (str_detect(x,"\\/")){
= str_remove_all(x,"\\*")
x = str_remove_all(x,"\\:")
x = str_replace(x,"\\/","-")
x
}else{
= str_replace(x,"HLA-","")
x = str_replace(x,"\\*","_")
x = str_replace(x,"\\:","")
x
}return(x)
}$allele1 = sapply(Dataset_final$allele,fun)
Dataset_final<- left_join(Dataset_final,pseduo_seq, by = c("allele1" = "allele"))
Dataset_final <- na.omit(Dataset_final)
Dataset_final $length <- sapply(Dataset_final$peptide_sequence,nchar)
Dataset_final<- Dataset_final %>% filter(length>=13&length<=21)
Dataset_final <- read_csv("../data/BA_MHC_label.csv")
BA <- anti_join(Dataset_final,BA,by = c("peptide_sequence" = "peptide","allele1"="MHC")) #Delete the data contained in the training dataset
Automated_filter write_csv(Dataset_final,"../data/Automated_benchmark.csv")
write_csv(Automated_filter,"../data/Automated_benchmark_filter.csv")
The Automated dataset are shown below:
::datatable(Automated_filter,rownames = FALSE) DT
Result
The process of using the different tools is shown below, all processes are done on HPC.
BAmodel:
#python
import tensorflow as tf
import pandas as pd
import numpy as np
= tf.keras.models.load_model("/public/slst/home/wanggsh/Project/Saved_model/new_BA_model/model")
model = pd.read_feather("/public/slst/home/wanggsh/Data/MHCII/Benchmark/Automated_benchmark_filter.feather")
Data = pd.read_csv("/public/slst/home/wanggsh/Data/MHCII/netMHCpanII4.0_train_data/BA_data/BA_allele.csv",header=None,names=["allele"])
BA_allele bool = Data["Allele Name1"].map(lambda X : X in BA_allele["allele"].values)
= Data[bool]
Data print(len(Data))
print("Data finish")
= Data.reset_index(drop = True)
Data = np.empty((len(Data),21,21))
peptide for i in range(len(Data)):
= Data["pep_blosum"][i].reshape((21,21))
peptide[i] #mhc
= np.empty((len(Data),34,21))
MHC for i in range(len(Data)):
= Data["MHC_blosum"][i].reshape((34,21))
MHC[i]
= Data['label'].values
label = model.predict([peptide,MHC]).flatten()
Y "prediction"] = Y
Data[
= Data["Allele Name1"].unique()
allele = Data
BA_result "/public/slst/home/wanggsh/Data/MHCII/Benchmark/BA_model/BA_Automated_prediction_result.csv") BA_result.to_csv(
NetMHCIIpan:
#python
import pandas as pd
import os
import sys
= "/public/slst/home/wanggsh/Data/MHCII/Benchmark"
result_file_path = pd.read_csv("/public/slst/home/wanggsh/Data/MHCII/Benchmark/Automated_benchmark_filter.csv")
EL = range(len(EL))
EL.index = EL["Allele Name"].unique()
allele = pd.DataFrame()
PD = "python /public/slst/home/wanggsh/biosoft/mhc_ii/mhc_II_binding.py netmhciipan_ba"
comd = "{0}/pep.csv".format(result_file_path)
file_path
for i in allele:
= EL[EL["Allele Name"] == i]
aEL = aEL["length"].unique()
length for l in length:
= aEL[EL["length"] == l]
lEL "Description"].to_csv("{0}".format(file_path),header = False,index = False)
lEL["{0} {1} {2} {3} > {4}/result.txt".format(comd,i,file_path,l,result_file_path))
os.system(= pd.read_table("{0}/result.txt".format(result_file_path))
result = result.append(PD)
PD"{0}/netMHCIIpan_Auto_filter_ba_result.csv".format(result_file_path))
PD.to_csv(print("finish")
mixMHC2pred:
#python
import pandas as pd
import os
import sys
import re
= "/public/slst/home/wanggsh/Data/MHCII/Benchmark"
result_file_path = pd.read_csv("/public/slst/home/wanggsh/Data/MHCII/Benchmark/Automated_benchmark_filter.csv")
EL def allele_change(x):
if len(x) < 10:
= x[0:7]+"_"+x[7:]
x else :
= re.sub("HLA-","",x)
x = x[0:4]+"_"+x[4:6]+"_"+x[6:13]+"_"+x[13:15]+"_"+x[15:]
x = re.sub("-","__",x)
x return x
"new_allele"] = EL["Allele Name1"].map(allele_change)
EL["/public/slst/home/wanggsh/Data/MHCII/Benchmark/mixMHC2pred_Automated.csv")
EL.to_csv(= EL["new_allele"].unique()
allele = pd.DataFrame()
PD = "/public/slst/home/wanggsh/biosoft/MixMHC2pred-master/MixMHC2pred_unix --no_context"
comd = "{0}/pep.csv".format(result_file_path)
file_path = "{0}/result.txt".format(result_file_path)
result
for i in allele:
= EL[EL["new_allele"] == i]
aEL "Description"].to_csv("{0}".format(file_path),header = False,index = False)
aEL["{0} -i {1} -o {2} -a {3}".format(comd,file_path,result,i))
os.system(= pd.read_table("{0}".format(result),comment="#")
result_file = pd.concat([result_file,PD])
PD
"{0}/mixMHC2pred_Auto_filter_ba_result.csv".format(result_file_path))
PD.to_csv(print("finish")
Plt ROC fig
The area under the curve (AUC) of the receiver operating characteristic (ROC) is used as the evaluation index for model comparations:
#python
= pd.read_csv("../data/Automated_benchmark.csv")
Automate = pd.read_csv("../data/BA_Automated_prediction_result.csv")
BA_result = roc_curve(BA_result["label"],BA_result["prediction"],pos_label=1)
BA_fpr,BA_tpr,_ = auc(BA_fpr,BA_tpr)
BA_roc = pd.read_csv("../data/netMHCIIpan_Auto_filter_ba_result.csv")
netMHC_result = netMHC_result[["allele","percentile_rank","peptide","ic50"]]
netMHC_result = Automate[["allele","allele1","peptide_sequence","measurement_value"]]
Automate = pd.merge(netMHC_result,Automate,left_on=["allele","peptide"],right_on=["allele","peptide_sequence"]).drop_duplicates()
netMHC_result = pd.read_csv("../data/BA_allele.csv",header=None,names=["allele"])
BA_allele bool = netMHC_result["allele1"].map(lambda x : x in BA_allele["allele"].values)
= netMHC_result[bool]
netMHC_result = roc_curve(netMHC_result["measurement_value"],1-netMHC_result["percentile_rank"],pos_label=1)
net_fpr,net_tpr,_ = auc(net_fpr,net_tpr)
net_roc = pd.read_csv("../data/mixMHC2pred_Auto_filter_ba_result.csv")
mix_result = pd.read_csv("../data/mixMHC2pred_Automated.csv")
mix_data = pd.merge(mix_result[["Peptide","BestAllele","%Rank_best"]],mix_data[["Description","new_allele","label","Allele Name1"]],left_on =["Peptide","BestAllele"],right_on=["Description","new_allele"]).drop_duplicates()
mix_result = mix_result[mix_result["Allele Name1"].isin(BA_allele["allele"])]
mix_result = roc_curve(mix_result["label"],1-mix_result["%Rank_best"],pos_label=1)
mix_fpr,mix_tpr,_ = auc(mix_fpr,mix_tpr)
mix_roc = plt.get_cmap('Set1')
cmap = [mpl.colors.rgb2hex(cmap(i)[:3]) for i in range(3)]
colors = plt.subplots(figsize = (8,8))
fig,ax = 2
lw =lw,color = colors[0],label="BA_model (AUC = %0.4f)" % BA_roc,)
plt.plot(BA_fpr,BA_tpr,lw#> [<matplotlib.lines.Line2D object at 0x28ee48910>]
=lw,color = colors[1],label="NetMHCIIpan_BA (AUC = %0.4f)" % net_roc,)
plt.plot(net_fpr,net_tpr,lw#> [<matplotlib.lines.Line2D object at 0x28ee48640>]
=lw,color = colors[2],label="mixMHC2pred (AUC = %0.4f)" % mix_roc,)
plt.plot(mix_fpr,mix_tpr,lw#> [<matplotlib.lines.Line2D object at 0x17e91c910>]
0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.plot([#> [<matplotlib.lines.Line2D object at 0x17e91cb80>]
0.0, 1.0])
plt.xlim([#> (0.0, 1.0)
0.0, 1.05])
plt.ylim([#> (0.0, 1.05)
"False Positive Rate",fontsize = 14,fontname='Arial')
plt.xlabel(#> Text(0.5, 0, 'False Positive Rate')
"True Positive Rate",fontsize = 14)
plt.ylabel(#> Text(0, 0.5, 'True Positive Rate')
"")
plt.title(#> Text(0.5, 1.0, '')
="lower right",fontsize = 12)
plt.legend(loc#> <matplotlib.legend.Legend object at 0x28d2be3a0>
"../figure/BA_model_benchmark.pdf",dpi = 300,transparent=True)
plt.savefig( plt.show()
The BA model (AUC: 0.7742) shows better performance than MixMHC2pred (AUC: 0.6482) and has comparable performance with NetMHCIIpan (AUC: 0.7965).
Next we plot the confusion matrix:
from collections import Counter
= plt.subplots(figsize = (8,8))
fig,ax = Automate,x = "measurement_value")
sns.countplot(data #> <AxesSubplot:xlabel='measurement_value', ylabel='count'>
-0.1, 10950, "{:.3f}".format(Counter(Automate["measurement_value"])[0]/len(Automate)))
ax.text(#> Text(-0.1, 10950, '0.500')
0.9, 10950, "{:.3f}".format(Counter(Automate["measurement_value"])[1]/len(Automate)))
ax.text(#> Text(0.9, 10950, '0.500')
"Number")
plt.ylabel(#> Text(0, 0.5, 'Number')
"label")
plt.xlabel(#> Text(0.5, 0, 'label')
"The distribution of different label in Automate")
plt.title(#> Text(0.5, 1.0, 'The distribution of different label in Automate')
"../figure/Automate_label_dis.pdf",dpi = 300,transparent=True)
plt.savefig( plt.show()
#python
"pre_label"] = BA_result["prediction"].map(lambda x : 1 if x>0.426 else 0)
BA_result[= confusion_matrix(BA_result["label"],BA_result["pre_label"])
BA_CM = BA_CM/len(BA_result)
BA_CM "pre_label"] = netMHC_result["percentile_rank"].map(lambda x : 1 if x <2 else 0)
netMHC_result[= confusion_matrix(netMHC_result["measurement_value"],netMHC_result["pre_label"])
netMHC_CM = netMHC_CM/len(netMHC_result)
netMHC_CM "pre_label"] = mix_result["%Rank_best"].map(lambda x : 1 if x <2 else 0)
mix_result[= confusion_matrix(mix_result["label"],mix_result["pre_label"])
mix_CM = mix_CM/len(mix_result) mix_CM
import math
def call_metrics(CM):
= CM.ravel()[0]
tn = CM.ravel()[1]
fp = CM.ravel()[2]
fn = CM.ravel()[3]
tp = tp/(tp+fn)
se = tn/(tn+fp)
sp = tp/(tp+fp)
ppv = tn/(tn+fn)
npv = 2*tp/(2*tp+fp+fn)
f_score= math.sqrt((se-1)**2+(sp-1)**2+(ppv-1)**2+(npv-1)**2)
DOP
return se,sp,ppv,npv,f_score,DOP
call_metrics(BA_CM)#> (0.6652916912197997, 0.7387634040611453, 0.747352162400706, 0.6552003237555645, 0.7039384807232673, 0.6024881389754401)
call_metrics(netMHC_CM)#> (0.0754679802955665, 0.993837023510614, 0.9341463414634146, 0.4813177094848552, 0.1396536007292616, 1.0621513360038004)
call_metrics(mix_CM)#> (0.05004926108374384, 0.9917826980141521, 0.8758620689655172, 0.47403447523456244, 0.09468779123951537, 1.0929427659227045)
= plt.subplots(figsize = (8,6))
fig,ax =True,cmap="YlGnBu",vmin=0, vmax=0.6)
sns.heatmap(BA_CM, annot#> <AxesSubplot:>
"True label")
plt.ylabel(#> Text(70.72222222222221, 0.5, 'True label')
"Prediction label")
plt.xlabel(#> Text(0.5, 36.72222222222221, 'Prediction label')
"BA model confusion matrix(threshold:0.426)")
plt.title(#> Text(0.5, 1.0, 'BA model confusion matrix(threshold:0.426)')
"../figure/BA_CM.pdf",dpi = 300,transparent=True)
plt.savefig( plt.show()
#python
= plt.subplots(figsize = (8,6))
fig,ax =True,cmap="YlGnBu",vmin=0, vmax=0.6)
sns.heatmap(netMHC_CM, annot#> <AxesSubplot:>
"True label")
plt.ylabel(#> Text(70.72222222222221, 0.5, 'True label')
"Prediction label")
plt.xlabel(#> Text(0.5, 36.72222222222221, 'Prediction label')
"NetMHCIIpan_BA confusion matrix(Rank:10%)")
plt.title(#> Text(0.5, 1.0, 'NetMHCIIpan_BA confusion matrix(Rank:10%)')
"../figure/netMHCIIpan_CM.pdf",dpi = 300,transparent=True)
plt.savefig( plt.show()
#python
= plt.subplots(figsize = (8,6))
fig,ax =True,cmap="YlGnBu",vmin=0, vmax=0.6)
sns.heatmap(mix_CM,annot #> <AxesSubplot:>
"True label")
plt.ylabel(#> Text(70.72222222222221, 0.5, 'True label')
"Prediction label")
plt.xlabel(#> Text(0.5, 36.72222222222221, 'Prediction label')
"MixMHC2pred confusion matrix(Rank:10%)")
plt.title(#> Text(0.5, 1.0, 'MixMHC2pred confusion matrix(Rank:10%)')
"../figure/mixMHC2pred_CM.pdf",dpi = 300,transparent=True)
plt.savefig( plt.show()
By using IEDB recommend cutoff 2%rank and 0.426 (500nm, log50k transformed), we found NetMHCIIpan and MixMHC2pred tend to predict the result as negative while our BA model can correctly distinguish between negatives and positives
another dataset
TLimmno2: MHC II immunogenicity prediction network based on transfer learning
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
'pdf.fonttype'] = 42
mpl.rcParams['ps.fonttype'] = 42
mpl.rcParams['font.family':'Arial'})
plt.rcParams.update({import seaborn as sns
import numpy as np
from sklearn.metrics import confusion_matrix,accuracy_score
from sklearn.metrics import roc_curve,auc,confusion_matrix,precision_score,recall_score,precision_recall_curve
import matplotlib as mpl
= "~/Project/Documenet/TLIMM2pred/data/"
path import random
This part focuses on TLimmuno2 model construction. it contains the following sections:
- Training data - Data download and process
- BA model - constructed BA model by LSTM
- BA model perforemance - Compared the performance of BA model with other models on independent dataset.
Data preprocessing
Immunogenicity molecular assays were downloaded from the Immune Epitope Database (IEDB) on May 15, 2022. The following keywords were used: Linear peptide; T cell; Include positive and negative; MHC II; Human and any disease. We next used strict criteria for data cleaning. First, data instances without explicit 4-digit MHC alleles were discarded. Second, the length of the peptides is limited in the range of 13-21 mer. Third, the data with low confidence (no information in number of subjects tested/responded) were removed. Finally, we chose five experiment types, “51 chromium”, “ELISA”, “ELISPOT”, “ICS” and “multimer/tetramer” to train model as the reason of we think these are more reliable. The rest of experiment types is used for model benchmark. In the final, we got 3930 positive instances and 2478 negative instances for training, and 105 positive instances and 4 negative instances for benchmark.
#R
<- read_csv("../data/IEDB_MHCII.csv")
IEDB_immuno_MHCII <- read_table("../data/pseudosequence.2016.all.X.dat",col_names = c("allele","sequence"))
pseduo_seq colnames(IEDB_immuno_MHCII) <- IEDB_immuno_MHCII[1,]
<- IEDB_immuno_MHCII[-1,]
IEDB_immuno_MHCII <- IEDB_immuno_MHCII[,c(12,85,88,102,91,92,93)] #4:pubmed ID
IEDB_immuno_MHCII_filter $`Qualitative Measure` <- str_replace(IEDB_immuno_MHCII_filter$`Qualitative Measure`,"Positive-High","Positive")
IEDB_immuno_MHCII_filter$`Qualitative Measure` <- str_replace(IEDB_immuno_MHCII_filter$`Qualitative Measure`,"Positive-Intermediate","Positive")
IEDB_immuno_MHCII_filter$`Qualitative Measure` <- str_replace(IEDB_immuno_MHCII_filter$`Qualitative Measure`,"Positive-Low","Positive")
IEDB_immuno_MHCII_filter<- IEDB_immuno_MHCII_filter[!(duplicated(IEDB_immuno_MHCII_filter)),]
IEDB_immuno_MHCII_filter $length <- sapply(IEDB_immuno_MHCII_filter$Description, nchar)
IEDB_immuno_MHCII_filter#add method ratio
<- IEDB_immuno_MHCII_filter %>% filter(nchar(`Allele Name`)>=14) %>% filter(length>=13&length<=21) %>% group_by(`Method/Technique`) %>% summarise(num = n())
IEDB_immuno_assay ggplot(data = IEDB_immuno_assay)+
geom_col(aes(x = reorder(`Method/Technique`,-num),y=num))+xlab("Expirement Type")+ylab("Number") + theme(axis.text.x = element_text(angle=90, hjust=1, vjust=.5))
ggsave("../figure/IEDB_assay_type.pdf")
<- IEDB_immuno_MHCII_filter %>% filter(nchar(`Allele Name`)>=14) %>% filter(length>=13&length<=21) %>%
IEDB_immuno_MHCII_filter1 filter(`Method/Technique`=="51 chromium"|`Method/Technique`=="ELISA"|`Method/Technique`=="ELISPOT"|`Method/Technique`=="ICS"|`Method/Technique`=="multimer/tetramer") %>% filter(str_detect(`Allele Name`,"HLA"))
<- function(x){
fun if (str_detect(x,"\\/")){
= str_remove_all(x,"\\*")
x = str_remove_all(x,"\\:")
x = str_replace(x,"\\/","-")
x
}else{
= str_replace(x,"HLA-","")
x = str_replace(x,"\\*","_")
x = str_replace(x,"\\:","")
x
}return(x)
}$`Allele Name1` <- sapply(IEDB_immuno_MHCII_filter1$`Allele Name`,fun)
IEDB_immuno_MHCII_filter1<- left_join(IEDB_immuno_MHCII_filter1,pseduo_seq, by = c("Allele Name1" = "allele"))
IEDB_immuno_MHCII_filter1 <- na.omit(IEDB_immuno_MHCII_filter1) #with response
IEDB_immuno_MHCII_filter_final <- IEDB_immuno_MHCII_filter_final[!(duplicated(IEDB_immuno_MHCII_filter_final[,c(1,4)])),]
IEDB_immuno_MHCII_filter_final
<- unique((IEDB_immuno_MHCII_filter_final$`Allele Name1`))
MHC_immuno_allele write_lines(MHC_immuno_allele,"../data/IEDB_MHCII_immuno_allele.csv")
write_csv(IEDB_immuno_MHCII_filter_final,"../data/IEDB_MHCII_immuno.csv")
# plt fig
ggplot(data = IEDB_immuno_MHCII_filter_final)+
geom_histogram(aes(x = `Method/Technique`),stat = "count")+xlab("Expirement Type")+ylab("Number") + labs(title = "IEDB")+ theme(plot.title = element_text(hjust = 0.5))
ggplot(data = IEDB_immuno_MHCII_filter_final)+
geom_histogram(aes(x = `Qualitative Measure`),stat = "count")+xlab("Label")+ylab("Number") + labs(title = "IEDB")+ theme(plot.title = element_text(hjust = 0.5))
Data for benchmark:
NEW!! These part is deleated in the paper.
#R
<- IEDB_immuno_MHCII_filter %>% filter(nchar(`Allele Name`)>=14) %>% filter(length>=13&length<=21) %>%
IEDB_immuno_MHCII_filter2 filter(!(`Method/Technique`=="51 chromium"|`Method/Technique`=="ELISA"|`Method/Technique`=="ELISPOT"|`Method/Technique`=="ICS"|`Method/Technique`=="multimer/tetramer")) %>%
filter(str_detect(`Allele Name`,"HLA"))
$`Allele Name1` <- sapply(IEDB_immuno_MHCII_filter2$`Allele Name`,fun)
IEDB_immuno_MHCII_filter2<- left_join(IEDB_immuno_MHCII_filter2,pseduo_seq, by = c("Allele Name1" = "allele"))
IEDB_immuno_MHCII_filter2 <- na.omit(IEDB_immuno_MHCII_filter2) # only response
IEDB_immuno_MHCII_benchmark <- IEDB_immuno_MHCII_benchmark[!(duplicated(IEDB_immuno_MHCII_benchmark[,c(1,4)])),]
IEDB_immuno_MHCII_benchmark <- IEDB_immuno_MHCII_benchmark$Description %in% IEDB_immuno_MHCII_filter_final$Description
bool1 <- IEDB_immuno_MHCII_benchmark$`Allele Name` %in% IEDB_immuno_MHCII_filter_final$`Allele Name`
bool2 <- IEDB_immuno_MHCII_benchmark[!(bool1&bool2),]
IEDB_immuno_MHCII_benchmark_final #write_csv(IEDB_immuno_MHCII_benchmark_final,"../data/IEDB_benchmark.csv")
We also analyzed the distribution of HLA types in the immunogenicity data and found that it has a similar distribution to the Binding affinity data, and the peptides with a length of 15mer accounted for the largest proportion.
= pd.read_csv("/Users/wangguangshuai/Data/MHCII/Data/IMM_model_train/IEDB_MHCII_immuno.csv")
IMM_data "Qualitative Measure"].value_counts().index,IMM_data["Qualitative Measure"].value_counts().values,palette =["green","red"])
sns.barplot(IMM_data[#> <AxesSubplot:>
#>
#> /Users/wangguangshuai/miniforge3/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
#> warnings.warn(
"Label")
plt.xlabel(#> Text(0.5, 0, 'Label')
"Number")
plt.ylabel(#> Text(0, 0.5, 'Number')
"../figure/IMM_label.pdf",dpi = 300,transparent=True)
plt.savefig(= IMM_data[IMM_data["Qualitative Measure"] == "Positive"]
allele_pos = pd.DataFrame(allele_pos["Allele Name1"].value_counts())
x "frac"] = x["Allele Name1"].map(lambda y: y/sum(x["Allele Name1"].values))
x[= list(x.index[0:15])
a = list(x["frac"].values[0:15])
b "Others")
a.append(sum(x["frac"].values[15:]))
b.append(= plt.subplots()
fig,ax = "green")
plt.bar(a,b,color #> <BarContainer object of 16 artists>
0,0.3])
plt.ylim([#> (0.0, 0.3)
0.00,0.05,0.10,0.15,0.20,0.25])
plt.yticks([#> ([<matplotlib.axis.YTick object at 0x178e20d90>, <matplotlib.axis.YTick object at 0x178e054c0>, <matplotlib.axis.YTick object at 0x28e92f940>, <matplotlib.axis.YTick object at 0x17e9b12b0>, <matplotlib.axis.YTick object at 0x17ea0c7f0>, <matplotlib.axis.YTick object at 0x28eea7430>], [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])
= 90)
plt.xticks(rotation #> ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])
"../figure/allele_pos.pdf",dpi = 300,bbox_inches='tight',transparent=True)
plt.savefig( plt.show()
= IMM_data[IMM_data["Qualitative Measure"] == "Negative"]
allele_neg = pd.DataFrame(allele_neg["Allele Name1"].value_counts())
x "frac"] = x["Allele Name1"].map(lambda y: y/sum(x["Allele Name1"].values))
x[= list(x.index[0:15])
a = list(x["frac"].values[0:15])
b "Others")
a.append(sum(x["frac"].values[15:]))
b.append(= plt.subplots(figsize = (8,6))
fig,ax = plt.bar(a,b,color = "red")
ax 0,0.3])
plt.ylim([#> (0.0, 0.3)
0.00,0.05,0.10,0.15,0.20,0.25])
plt.yticks([#> ([<matplotlib.axis.YTick object at 0x17e817700>, <matplotlib.axis.YTick object at 0x17e8174c0>, <matplotlib.axis.YTick object at 0x28eea74f0>, <matplotlib.axis.YTick object at 0x28eece610>, <matplotlib.axis.YTick object at 0x28eeced00>, <matplotlib.axis.YTick object at 0x17e081610>], [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])
= 90)
plt.xticks(rotation #> ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])
"../figure/allele_neg.pdf",dpi = 300,transparent=True)
plt.savefig( plt.show()
= list(allele_pos["length"].value_counts().values)
pos_len_value = allele_pos["length"].value_counts().index
pos_len_index = allele_pos["length"].value_counts().values/sum(allele_pos["length"].value_counts())
norm_pos_len_value = list(allele_neg["length"].value_counts().values)
neg_len_value = allele_neg["length"].value_counts().index
neg_len_index = allele_neg["length"].value_counts().values/sum(allele_neg["length"].value_counts())
norm_neg_len_value = plt.subplots(figsize = (8,6))
fig,ax = 0.35
width = pos_len_index
x - width/2, norm_pos_len_value,width,label='immunogenic',color = "green")
ax.bar(pos_len_index #> <BarContainer object of 9 artists>
+ width/2, norm_neg_len_value, width, label='noimmunogenic',color = "red")
ax.bar(neg_len_index #> <BarContainer object of 9 artists>
'proportion')
ax.set_ylabel(#> Text(0, 0.5, 'proportion')
'length')
ax.set_xlabel(#> Text(0.5, 0, 'length')
'Distribution of Length')
ax.set_title(#> Text(0.5, 1.0, 'Distribution of Length')
ax.set_xticks(x)#> [<matplotlib.axis.XTick object at 0x28d2e1670>, <matplotlib.axis.XTick object at 0x28d2e1370>, <matplotlib.axis.XTick object at 0x17e80f280>, <matplotlib.axis.XTick object at 0x28d2d4940>, <matplotlib.axis.XTick object at 0x28d2d49d0>, <matplotlib.axis.XTick object at 0x28ed4d4c0>, <matplotlib.axis.XTick object at 0x28d2d46d0>, <matplotlib.axis.XTick object at 0x28e9890a0>, <matplotlib.axis.XTick object at 0x17e827580>]
ax.legend()#> <matplotlib.legend.Legend object at 0x17e802b50>
"../figure/length.pdf",dpi = 300,transparent=True)
plt.savefig( plt.show()
Get random peptide as decoay
In our data, we found that the percentage of data with immunogenic peptides is more than the percentage of data without immunogenicity. The reason for this problem is that the data without response information (number of subjects tested/responded) in the IEDB dataset are also considered as negative peptides and we removed them during data cleaning. To avoid this anomaly, we used all human peptide sequence download from the universal protein knowledgebase (UniProt) database (2022.3) to get random artificial negative peptides. We generated 2 times and 5 times random peptides for each length of each MHC allele as negative instances to construct IEDB-training dataset and IEDB-benchmark dataset.
#python
=open("../data/uniport_human.fasta")
f=[]
lsfor line in f:
if not line.startswith('>'):
'\n',''))
ls.append(line.replace(
f.close()= pd.DataFrame(ls)
uniport "length"] = uniport[0].map(len)
uniport[bool = uniport["length"]>30
= uniport[bool]
uniport bool = uniport[0].apply(lambda x: set(list(x)).issubset(set(list("ARNDCQEGHILKMFPSTWYVX"))))
= uniport[bool]
uniport = uniport[0].values
uniport_random def random_select(x,length,num,allele):
= []
df all = []
for i in range(num):
= random.randint(0,len(x)-1)
protein_num = x[protein_num]
protein = random.randint(0,len(protein)-length)
peptide_num = protein[peptide_num:peptide_num+length]
peptide
df.append(peptide)all.append(allele)
return df,all
= pd.read_csv("../data/IEDB_MHCII_immuno.csv")
IMM_data = []
negative = []
allele = 2 # negative times, 2 for IEDB-training and 5 for IEDB-benchmark
time for i in range(time):
for i in IMM_data["Allele Name1"].unique():
= IMM_data[IMM_data["Allele Name1"] == i]["length"].value_counts()
smal_IMM for l in range(len(smal_IMM)):
= int(smal_IMM.index[l])
length = int(smal_IMM.values[l])
num all = random_select(uniport_random,length,num,i)
pep,
negative.extend(pep)all)
allele.extend(
= IMM_data[["Description","Qualitative Measure","Allele Name1"]]
IMM_data = pd.DataFrame({"Description":negative,"Allele Name1":allele})
random_data "Qualitative Measure"] = "Negative"
random_data[= pd.concat([IMM_data,random_data],axis=0)
final_data "label"] = final_data["Qualitative Measure"].map(lambda x : 1 if x == "Positive" else 0)
final_data[= final_data.drop_duplicates()
final_data = pd.read_table("../data/pseudosequence.2016.all.X.dat",header=None,names=["Allele Name1","sequence"])
pseudo_data = pseudo_data.drop_duplicates()
pseudo_data = pd.merge(final_data,pseudo_data).drop_duplicates()
Data #Data.to_csv("../data/IMM_data_random{}.csv".format(time))
Model train
We employed transfer learning to add binding affinity information by extract penultimate layer of BA model. We also used LSTM layers to get antigen and MHC information, because we found our model got a great significant improvement by using this method. We concatenated the three numerical vectors into a single layer, added a dense layer with 400 neurons activated by RELU, a dense layer with 200 neurons activated by RELU and the last layer with a single neuron with sigmoid activation.We named the final model of MHC II immunogenicity prediction network based on transfer learning as “TLimmuno2”.
#PBS
#python
import pandas as pd
import os
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import *
from sklearn.model_selection import train_test_split,KFold
import numpy as np
from collections import Counter
from sklearn.metrics import auc,precision_recall_curve,roc_curve,confusion_matrix,average_precision_score
from collections import Counter
import matplotlib.pyplot as plt
import sys
"/public/slst/home/wanggsh/Project/MHCII/utils")
sys.path.append(from Blosum62 import blosum62
#Data process
= pd.read_csv("/public/slst/home/wanggsh/Data/MHCII_immuno/IMM_data_random2.csv")
immuno_data "pep_blosum"] = immuno_data["Description"].apply(blosum62,args=(21,))
immuno_data["MHC_blosum"] = immuno_data["sequence"].apply(blosum62,args=(34,))
immuno_data[= np.empty((len(immuno_data),21,21))
peptide for i in range(len(immuno_data)):
= immuno_data["pep_blosum"][i].reshape((21,21))
peptide[i] = np.empty((len(immuno_data),34,21))
MHC for i in range(len(immuno_data)):
= immuno_data["MHC_blosum"][i].reshape((34,21))
MHC[i] = immuno_data["label"].values
labels = tf.keras.models.load_model("/public/slst/home/wanggsh/Project/Saved_model/new_BA_model/model1")
BAmodel = tf.keras.models.Model(inputs = BAmodel.input,outputs = BAmodel.layers[-2].output)
BAmodel = BAmodel.predict([peptide,MHC])
BA #class weight
= np.bincount(immuno_data["label"])
neg, pos = neg + pos
total print('Examples:\n Total: {}\n Positive: {} ({:.2f}% of total)\n'.format(
100 * pos / total))
total, pos, = (1 / neg) * (total / 2.0)
weight_for_0 = (1 / pos) * (total / 2.0)
weight_for_1 = {0: weight_for_0, 1: weight_for_1}
class_weight print('Weight for class 0: {:.2f}'.format(weight_for_0))
print('Weight for class 1: {:.2f}'.format(weight_for_1))
#
= train_test_split(peptide,MHC,labels,BA,test_size=0.1,random_state=202205,stratify=labels)
pep_Train,pep_Test,MHC_Train,MHC_Test,label_Train,label_Test,BA_Train,BA_Test print(Counter(label_Train))
print(Counter(label_Test))
print("Data progress finish!!")
def LSTMmodel():
= tf.keras.Input(shape = (21,21))
pep = tf.keras.Input(shape = (34,21))
MHC = tf.keras.Input(shape = (10))
BA
= tf.keras.layers.LSTM(256,return_sequences=True)(pep)
x = tf.keras.layers.LSTM(128,return_sequences=True)(x)
x = tf.keras.layers.Flatten()(x)
x = tf.keras.Model(inputs = pep,outputs = x)
x
= tf.keras.layers.LSTM(256,return_sequences=True)(MHC)
y = tf.keras.layers.LSTM(128,return_sequences=True)(y)
y = tf.keras.layers.Flatten()(y)
y = tf.keras.Model(inputs = MHC,outputs = y)
y
= tf.keras.layers.concatenate([x.output,y.output,BA])
combined
= tf.keras.layers.Dense(400,activation="relu")(combined)
z #z = tf.keras.layers.Dropout(0.8)(z)
= tf.keras.layers.Dense(200,activation = "relu")(z) #,kernel_regularizer=tf.keras.regularizers.l2(0.0001)
z #z = tf.keras.layers.Dropout(0.8)(z)
= tf.keras.layers.Dense(1,activation = "sigmoid")(z)
z
= tf.keras.Model(inputs = [pep,MHC,BA],outputs = z)
model return model
#learning rate
def get_lr_metric(optimizer):
def lr(y_true, y_pred):
= optimizer._decayed_lr(tf.float32)
curLR return curLR
return lr
= tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)#change
callback = keras.losses.BinaryCrossentropy()
Loss = keras.optimizers.Adam(learning_rate = 0.00008) #
Optimizer = get_lr_metric(Optimizer)
lr_metric = [tf.keras.metrics.AUC(),tf.keras.metrics.BinaryAccuracy(),tf.keras.metrics.AUC(curve = "PR"),tf.keras.metrics.Precision()]
Metrics = 200
Batch_size= 150
Epochs= 2 #reduce information output
Verbose
#Cross validation
print("Cross Validation")
= KFold(n_splits=5,shuffle = True)
kf = {}
ROC = {}
PR = 1
i for train_index,val_index in kf.split(label_Train):
= pep_Train[train_index],MHC_Train[train_index],label_Train[train_index],BA_Train[train_index]
pep_train,MHC_train,label_train,BA_train = pep_Train[val_index],MHC_Train[val_index],label_Train[val_index],BA_Train[val_index]
pep_val,MHC_val,label_val,BA_val print(Counter(label_train))
print(Counter(label_val))
= LSTMmodel()
model compile(
model.= Loss,
loss = Optimizer,
optimizer = Metrics)
metrics #callback
= model.fit([pep_train,MHC_train,BA_train],label_train,
history =Batch_size,epochs=Epochs,
batch_size= ([pep_val,MHC_val,BA_val],label_val),
validation_data = Verbose,
verbose =class_weight,callbacks = callback)
class_weight= pd.DataFrame(history.history)
His_df "/public/slst/home/wanggsh/Project/Saved_model/History/immuno/corss_history{0}.csv".format(i))
His_df.to_csv(= model.predict([pep_val,MHC_val,BA_val])
prediction = roc_curve(label_val,prediction,pos_label=1)
fpr,tpr,thresholds = auc(fpr,tpr)
area_mine "flod{}".format(i)] = [fpr,tpr,thresholds,area_mine]
ROC[print("=============={} auc data finish==========".format(i))
= precision_recall_curve(label_val,prediction,pos_label=1)
precision,recall,thresholds = auc(recall,precision)
area_PR "flod{}".format(i)] = [recall,precision,thresholds,area_PR]
PR[print("=============={} pr data finish==========".format(i))
= i+1
i #----plot cross fig
= pd.DataFrame(ROC)
ROC_df = plt.figure()
fig = 2
lw = fig.add_subplot(111)
ax 0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
ax.plot([for i in range(5):
0,i],ROC_df.iloc[1,i],label='flod{0} (area = {1:.4f})'.format(i,ROC_df.iloc[3,i]))
ax.plot(ROC_df.iloc[0,1.05])
ax.set_ylim([0,1])
ax.set_xlim(['False Positive Rate')
ax.set_xlabel('True Positive Rate')
ax.set_ylabel('Receiver operating characteristic example')
ax.set_title(="lower right")
ax.legend(loc"/public/slst/home/wanggsh/Project/Saved_model/Model_fig/Immuno_model/ROC_cross.pdf",dpi = 300)
plt.savefig(= pd.DataFrame(PR)
PR_df
plt.figure()= 2
lw for i in range(5):
0,i],PR_df.iloc[1,i],lw=lw, label = 'flod{0} (area = {1:.4f})'.format(i,PR_df.iloc[3,i]))
plt.plot(PR_df.iloc[0.0, 1.0])
plt.xlim([0.0, 1.05])
plt.ylim(['Recall')
plt.xlabel('Precision')
plt.ylabel('PR curve')
plt.title(="lower right")
plt.legend(loc
plt.show()"/public/slst/home/wanggsh/Project/Saved_model/Model_fig/Immuno_model/PR_cross.pdf",dpi = 300)
plt.savefig(print("-"*20)
print("start model training")
= LSTMmodel()
model
compile(
model.= Loss,
loss = Optimizer,
optimizer = Metrics)
metrics #callback
= model.fit([pep_Train,MHC_Train,BA_Train],label_Train,
history =Batch_size,epochs=Epochs,
batch_size= ([pep_Test,MHC_Test,BA_Test],label_Test),
validation_data = Verbose,
verbose = callback) #class_weight=class_weight
callbacks r"/public/slst/home/wanggsh/Project/Saved_model/MHCII_immuno/model")
model.save(
print("start plt fig")
= pd.DataFrame(history.history)
His_df "/public/slst/home/wanggsh/Project/Saved_model/History/immuno/history.csv")
His_df.to_csv(
= model.predict([pep_Test,MHC_Test,BA_Test])
prediction = model.predict([pep_Train,MHC_Train,BA_Train])
Train_pre = roc_curve(label_Test,prediction,pos_label=1)
fpr,tpr,thresholds = roc_curve(label_Train,Train_pre,pos_label=1)
train_fpr,train_tpr,_ = auc(fpr,tpr)
area_mine = auc(train_fpr,train_tpr)
area_mine_train print(area_mine)
print(area_mine_train)
= precision_recall_curve(label_Test,prediction,pos_label=1)
precision,recall,thresholds = precision_recall_curve(label_Train,Train_pre,pos_label=1)
precision_train,recall_train,_ = auc(recall,precision)
area_PR = auc(recall_train,precision_train)
area_PR_train
= plt.figure()
fig = 2
lw = fig.add_subplot(111)
ax 0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
ax.plot([='Train AUC = (area = {0:.4f})'.format(area_mine_train))
ax.plot(train_fpr,train_tpr,label='Test AUC = (area = {0:.4f})'.format(area_mine))
ax.plot(fpr,tpr,label0,1.05])
ax.set_ylim([0,1])
ax.set_xlim(['False Positive Rate')
ax.set_xlabel('True Positive Rate')
ax.set_ylabel('Receiver operating characteristic example')
ax.set_title(="lower right")
ax.legend(loc"/public/slst/home/wanggsh/Project/Saved_model/Model_fig/Immuno_model/ROC.pdf",dpi = 300)
plt.savefig(
plt.figure()= 2
lw =lw, label = 'Train AUC = (area = {0:.4f})'.format(area_PR_train))
plt.plot(recall_train,precision_train,lw=lw, label = 'Test AUC = (area = {0:.4f})'.format(area_PR))
plt.plot(recall,precision,lw0.0, 1.0])
plt.xlim([0.0, 1.05])
plt.ylim(['Recall')
plt.xlabel('Precision')
plt.ylabel('PR curve example')
plt.title(="lower right")
plt.legend(loc
plt.show()"/public/slst/home/wanggsh/Project/Saved_model/Model_fig/Immuno_model/PR.pdf",dpi = 300)
plt.savefig(print("Finish")
TLimmuno2 outputs a continuous variable between 0 and 1 and also percentile rank. For each peptide-MHC II pair, percentile ranks were computed based on a dataset of 90,000 13- to 21-mer peptides (10,000 for each length) randomly selected from the human proteome, smaller rank indicates stronger immunogenicity
#PBS
#python
import pandas as pd
import numpy as np
import tensorflow as tf
import sys
"/public/slst/home/wanggsh/Project/MHCII/utils")
sys.path.append(from Blosum62 import blosum62
import os
"CUDA_VISIBLE_DEVICES"]="-1"
os.environ[#load_model
= tf.keras.models.load_model("/public/slst/home/wanggsh/Project/Saved_model/new_BA_model/model1")
BA_model = tf.keras.models.Model(inputs = BA_model.input,outputs = BA_model.layers[-2].output)
BAmodel = tf.keras.models.load_model("/public/slst/home/wanggsh/Project/Saved_model/MHCII_immuno/model")
IMM
= pd.read_csv("/public/slst/home/wanggsh/Data/MHCII_immuno/kmer_data.csv")
Data "pep_blosum"] = Data["pep"].apply(blosum62,args=(21,))
Data["MHC_blosum"] = Data["sequence"].apply(blosum62,args=(34,))
Data[= np.empty((len(Data),21,21))
peptide for i in range(len(Data)):
= Data["pep_blosum"][i].reshape((21,21))
peptide[i] #mhc
= np.empty((len(Data),34,21))
MHC for i in range(len(Data)):
= Data["MHC_blosum"][i].reshape((34,21))
MHC[i]
= BAmodel.predict([peptide,MHC])
BA = IMM.predict([peptide,MHC,BA])
IMM_result "prediction"] = IMM_result
Data[
= pd.read_feather("/public/slst/home/wanggsh/Data/pseudo_blosum62.feather")
pseudo_seq
#background pep
= pd.read_csv("/public/slst/home/wanggsh/Data/MHCII_immuno/IMM_bg_pep.csv")
IMM_bg_pep "pep_blosum"] = IMM_bg_pep["pep"].apply(blosum62,args=(21,))
IMM_bg_pep[= pd.DataFrame()
DF for i in Data["HLA"].unique():
"MHC"] = i
IMM_bg_pep[= pd.merge(IMM_bg_pep,pseudo_seq[["MHC","MHC_blosum"]])
x = np.empty((len(x),21,21))
peptide for z in range(len(x)):
= x["pep_blosum"][z].reshape((21,21))
peptide[z] = np.empty((len(x),34,21))
MHC for z in range(len(x)):
= x["MHC_blosum"][z].reshape((34,21))
MHC[z] = BAmodel.predict([peptide,MHC])
BA = IMM.predict([peptide,MHC,BA])
IMM_result = IMM_result.tolist()
IMM_result = Data[Data["HLA"]== i]
y = []
Rank for I in y["prediction"].values:
IMM_result.append(I)= 1-(sorted(IMM_result).index(IMM_result[-1])+1)/90001
rank
Rank.append(rank)
IMM_result.pop()"Rank"] = Rank
y[= pd.concat([DF,y])
DF "pep_blosum")
DF.pop("MHC_blosum")
DF.pop("/public/slst/home/wanggsh/Data/MHCII_immuno/kmer_rank_result.csv")
DF.to_csv(print("finish")
Model perfoemance
We first divided the data set into training and test sets by the ratio of 9:1 for model training and testing. To verify the stability of the model, 5-fold cross-validation was applied in the training process of TLimmuno2, and TLimmuno2 shows highly stable AUC (average 0.8727) across the validations . Overall, TLimmuno2 reaches AUC=0.9909 in training dataset and AUC=0.8648 in the test dataset .

cross ROC

cross PR

ROC

PR
model test dataset
= pd.read_csv("../data/model_test.csv")
data "len"] = data["peptide"].map(len)
data[= data[data["len"] >= 15]
data2 "peptide"].to_csv("../data/model_test_pep.csv",index = None,header = None)
data2[= pd.read_csv("../data/IEDB_MHCII_immuno.csv")
oridata = pd.merge(data,oridata[["Allele Name","Allele Name1"]],left_on = "MHC",right_on = "Allele Name1").drop_duplicates()
data1 "../data/model_test_netMHCIIpan.csv",index = None) data1.to_csv(
#pbs
#python
import pandas as pd
import os
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import *
from sklearn.model_selection import train_test_split,KFold
import numpy as np
from collections import Counter
"CUDA_VISIBLE_DEVICES"] = "-1"
os.environ[from sklearn.metrics import auc,precision_recall_curve,roc_curve,confusion_matrix,average_precision_score
from collections import Counter
import matplotlib.pyplot as plt
import matplotlib as mpl
'pdf.fonttype'] = 42
mpl.rcParams['ps.fonttype'] = 42
mpl.rcParams['font.family':'Arial'})
plt.rcParams.update({import sys
"/public/slst/home/wanggsh/Project/MHCII/utils")
sys.path.append(from Blosum62 import blosum62
#Data process
= pd.read_csv("/public/slst/home/wanggsh/Data/MHCII_immuno/IMM_data_random2.csv")
immuno_data
"pep_blosum"] = immuno_data["Description"].apply(blosum62,args=(21,))
immuno_data["MHC_blosum"] = immuno_data["sequence"].apply(blosum62,args=(34,))
immuno_data[= np.empty((len(immuno_data),21,21))
peptide for i in range(len(immuno_data)):
= immuno_data["pep_blosum"][i].reshape((21,21))
peptide[i] #mhc
= np.empty((len(immuno_data),34,21))
MHC for i in range(len(immuno_data)):
= immuno_data["MHC_blosum"][i].reshape((34,21))
MHC[i]
= immuno_data["label"].values
labels
= tf.keras.models.load_model("/public/slst/home/wanggsh/Project/Saved_model/new_BA_model/model1")
BAmodel = tf.keras.models.Model(inputs = BAmodel.input,outputs = BAmodel.layers[-2].output)
BAmodel = BAmodel.predict([peptide,MHC])
BA
#class weight
= np.bincount(immuno_data["label"])
neg, pos = neg + pos
total print('Examples:\n Total: {}\n Positive: {} ({:.2f}% of total)\n'.format(
100 * pos / total))
total, pos,
= (1 / neg) * (total / 2.0)
weight_for_0 = (1 / pos) * (total / 2.0)
weight_for_1
= {0: weight_for_0, 1: weight_for_1}
class_weight
print('Weight for class 0: {:.2f}'.format(weight_for_0))
print('Weight for class 1: {:.2f}'.format(weight_for_1))
= train_test_split(peptide,MHC,labels,BA,test_size=0.1,random_state=202205,stratify=labels)
pep_Train,pep_Test,MHC_Train,MHC_Test,label_Train,label_Test,BA_Train,BA_Test print(Counter(label_Train))
print(Counter(label_Test))
print("Data progress finish!!")
# Cross validation
print("Cross Validation")
= tf.keras.models.load_model("/public/slst/home/wanggsh/Project/Saved_model/MHCII_immuno/model")
model
= model.predict([pep_Test,MHC_Test,BA_Test])
prediction = model.predict([pep_Train,MHC_Train,BA_Train])
Train_pre = roc_curve(label_Test,prediction,pos_label=1)
fpr,tpr,thresholds = roc_curve(label_Train,Train_pre,pos_label=1)
train_fpr,train_tpr,_ = auc(fpr,tpr)
area_mine = auc(train_fpr,train_tpr)
area_mine_train print(area_mine)
print(area_mine_train)
= precision_recall_curve(label_Test,prediction,pos_label=1)
precision,recall,thresholds = precision_recall_curve(label_Train,Train_pre,pos_label=1)
precision_train,recall_train,_ = auc(recall,precision)
area_PR = auc(recall_train,precision_train)
area_PR_train
= plt.subplots(figsize = (8,8))
fig,ax = 2
lw 0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
ax.plot([='Train dataset (AUC = {0:.4f})'.format(area_mine_train))
ax.plot(train_fpr,train_tpr,label='Test dataset (AUC = {0:.4f})'.format(area_mine))
ax.plot(fpr,tpr,label0.0, 1.0])
plt.xlim([0.0, 1.05])
plt.ylim(["False Positive Rate",fontsize = 14)
plt.xlabel("True Positive Rate",fontsize = 14)
plt.ylabel("")
plt.title(="lower right",fontsize = 12)
plt.legend(loc"/public/slst/home/wanggsh/Project/Saved_model/Model_fig/Immuno_model/ROC_1121.pdf",dpi = 300,transparent=True)
plt.savefig(
= plt.subplots(figsize = (8,8))
fig,ax = 2
lw =lw, label = 'Train dataset (AUC = {0:.4f})'.format(area_PR_train))
ax.plot(recall_train,precision_train,lw=lw, label = 'Test dataset (AUC = {0:.4f})'.format(area_PR))
ax.plot(recall,precision,lw0.0, 1.0])
plt.xlim([0.0, 1.05])
plt.ylim(['Recall',fontsize = 14)
plt.xlabel('Precision',fontsize = 14)
plt.ylabel("")
plt.title(="lower right",fontsize = 12)
plt.legend(loc"/public/slst/home/wanggsh/Project/Saved_model/Model_fig/Immuno_model/PR_1121.pdf",dpi = 300,transparent=True)
plt.savefig(
def auc_pr(data,true_label,prediction,rank = False):
if rank:
= roc_curve(data[true_label],1-data[prediction])
fpr,tpr,_ = precision_recall_curve(data[true_label],1-data[prediction])
precision,recall,_ else:
= roc_curve(data[true_label],data[prediction])
fpr,tpr,_ = precision_recall_curve(data[true_label],data[prediction])
precision,recall,_
= auc(fpr,tpr)
AUC = auc(recall,precision)
PR
return fpr,tpr,recall,precision,AUC,PR
= pd.read_csv("/public/slst/home/wanggsh/Data/new/model_test.csv")
data = pd.read_csv("/public/slst/home/wanggsh/Data/new/model_test_iedb.csv")
test_iedb = pd.merge(data,test_iedb,left_on = "peptide",right_on = "Peptide")
test_iedb = auc_pr(test_iedb,"Label","Immunogenicity Score",rank = True)
test_iedb_result = pd.read_csv("/public/slst/home/wanggsh/Data/new/model_test_repitope.csv")
test_repitope = pd.merge(data,test_repitope,left_on = "peptide",right_on = "Peptide")
test_repitope = auc_pr(test_repitope,"Label","ImmunogenicityScore")
test_repitope_result = pd.read_csv("/public/slst/home/wanggsh/Data/new/model_test_ba.csv")
test_net_ba = pd.merge(data,test_net_ba,left_on = "peptide",right_on = "peptide")
test_net_ba = auc_pr(test_net_ba,"Label","percentile_rank",rank = True)
test_net_ba_result = pd.read_csv("/public/slst/home/wanggsh/Data/new/model_test_el.csv")
test_net_el = pd.merge(data,test_net_el,left_on = "peptide",right_on = "peptide")
test_net_el = auc_pr(test_net_el,"Label","percentile_rank",rank = True)
test_net_el_result
= plt.subplots(figsize = (8,8))
fig,ax = 2
lw 0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
ax.plot([='TLimmuno2 : (AUC = {0:.4f})'.format(area_mine))
ax.plot(fpr,tpr,label0],test_iedb_result[1],label='IEDB : (AUC={0:.4f})'.format(test_iedb_result[4]),linewidth = 2)
ax.plot(test_iedb_result[0],test_repitope_result[1],label='Repitope : (AUC={0:.4f})'.format(test_repitope_result[4]),linewidth = 2)
ax.plot(test_repitope_result[0],test_net_ba_result[1],label='NetMHCIIpan_ba : (AUC={0:.4f})'.format(test_net_ba_result[4]),linewidth = 2)
ax.plot(test_net_ba_result[0],test_net_el_result[1],label='NetMHCIIpan_el : (AUC={0:.4f})'.format(test_net_el_result[4]),linewidth = 2)
ax.plot(test_net_el_result[0.0, 1.0])
plt.xlim([0.0, 1.05])
plt.ylim(["False Positive Rate",fontsize = 14)
plt.xlabel("True Positive Rate",fontsize = 14)
plt.ylabel("")
plt.title(="lower right",fontsize = 12)
plt.legend(loc"/public/slst/home/wanggsh/Data/new/test_com_ROC_1121.pdf",dpi = 300,transparent=True)
plt.savefig(
= plt.subplots(figsize = (8,8))
fig,ax = 2
lw =lw, label = 'Test dataset (AUC = {0:.4f})'.format(area_PR))
ax.plot(recall,precision,lw2],test_iedb_result[3],label='IEDB : (AUC={0:.4f})'.format(test_iedb_result[5]),linewidth = 2)
ax.plot(test_iedb_result[2],test_repitope_result[3],label='Repitope : (AUC={0:.4f})'.format(test_repitope_result[5]),linewidth = 2)
ax.plot(test_repitope_result[2],test_net_ba_result[3],label='NetMHCIIpan_ba : (AUC={0:.4f})'.format(test_net_ba_result[5]),linewidth = 2)
ax.plot(test_net_ba_result[2],test_net_el_result[3],label='NetMHCIIpan_el : (AUC={0:.4f})'.format(test_net_el_result[5]),linewidth = 2)
ax.plot(test_net_el_result[0.0, 1.0])
plt.xlim([0.0, 1.05])
plt.ylim(['Recall',fontsize = 14)
plt.xlabel('Precision',fontsize = 14)
plt.ylabel("")
plt.title(="lower right",fontsize = 12)
plt.legend(loc"/public/slst/home/wanggsh/Data/new/test_com_PR_1121.pdf",dpi = 300,transparent=True)
plt.savefig(print("Finish")
TLimmuno2 outperforms existing methods in peptide-MHC II immunogenicity prediction
This part focuses on another independent dataset to compare the performance of TLimmuno2. it contains the following sections:
- Benchmark in immunoigenicity
- Structure immportance
- Neoepitope and wildtype peptide
- Position importance
Benchmark in immunoigenicity
The IEDB-Benchmark dataset is from the same database as the training set, thus they may have similar distribution patterns, which may lead to systematic bias. To perform an unbiased performance evaluation in predicting immunogenic peptides, we retrieved published papers to collect neoepitopes that were tested experimentally for CD4+ T cell immunogenicity.
Data proprecessing
#python
= pd.read_csv("../data/nature_peptide.csv")
peptide = pd.read_csv("../data/nature_HLA.csv")
HLA = peptide.drop_duplicates()
peptide "Immunogenicity"].value_counts()
peptide[#> 0 849
#> 1 80
#> Name: Immunogenicity, dtype: int64
"Len"] = peptide["Peptide"].map(len)
peptide["Len"]>=15]["Peptide"].to_csv("../data/pep_only.txt",header = None,index = None,sep = " ")#chain indexing
peptide[peptide[= peptide[peptide["Len"]>=15].reset_index(drop = True)
IEDB_pep "num"] = [x+1 for x in range(len(IEDB_pep))]#used to predict by IEDB tools
IEDB_pep[= ["Pep{}".format(x+1) for x in range(len(peptide))]
pep_ID "pep_ID"] = pep_ID
peptide[= plt.subplots()
fig,ax = peptide,x = "Len")
sns.countplot(data #> <AxesSubplot:xlabel='Len', ylabel='count'>
plt.show()
#python
= sns.countplot(data=peptide, x="Immunogenicity")
ax #ax.bar_label(ax.containers[0])
-0.1, 860, "{:.3f}".format(851/931))
ax.text(#> Text(-0.1, 860, '0.914')
0.9, 90, "{:.3f}".format(80/931))
ax.text(#> Text(0.9, 90, '0.086')
"../figure/nature_immuno_rate.pdf",dpi = 300,transparent=True)
plt.savefig( plt.show()
Neoepitopes tested experimentally were usually sequences of 20–25 amino acids, so we use 15-mer and k-mer methods (Methods) to assign a rank value to a given neoepitope.
For the 15mer method, we split one neoepitope sequence into 15 subsequences and score or rank each subsequence. In the kmer method, neoepitopes were cut into 13-21-mers subsequence. The prediction of the neoepitope is the maximum score or the minimum ranking of the predicted results of all subsequences by different methods. In case of a neoepitope with multiple MHC types, we used the maximum score or minimum rank of prediction for different MHC types as the prediction result for the neoepitope.
#python
= pd.merge(HLA,peptide)
Data = pd.read_csv("../data/IEDB_MHCII_immuno_allele.csv",header = None,names = ["HLA"])
IMM_train_data_allele = Data[Data["HLA"].isin(IMM_train_data_allele["HLA"])]
Data = pd.read_table("../data/pseudosequence.2016.all.X.dat",header = None,names = ["HLA","sequence"])
pseudo_seq = pd.merge(Data,pseudo_seq)
Data1 "length"] = Data1["Peptide"].map(len)
Data1[= Data1[Data1["length"]>=13]
Data1 = Data1.reset_index(drop = True)
Data1 #15mer
def mer15(Pep):
= []
P = 15
Length for i in range(len(Pep) - Length +1):
= Pep[i:i+Length]
pep
P.append(pep)return P
= pd.DataFrame()
DF15 for i in range(len(Data1)):
= mer15(Data1["Peptide"][i])
pep = pd.DataFrame(pep)
df "pep_ID"] = Data1["pep_ID"][i]
df["HLA"] = Data1["HLA"][i]
df[= pd.concat([DF15,df])
DF15 = ["pep","pep_ID","HLA"]
DF15.columns = pd.merge(DF15,pseudo_seq)
benchmark_data "../data/15mer_data.csv")
benchmark_data.to_csv("pep"].to_csv("../data/15mer_pep_only.txt",header = None,index = None,sep = " ")
benchmark_data[#kmer
def kmer(Pep,sa,end):
= []
P = range(sa,end+1)
R for i in R:
= i
Length for i in range(len(Pep) - Length +1):
= Pep[i:i+Length]
pep
P.append(pep)return P
= pd.DataFrame()
DF_k for i in range(len(Data1)):
= kmer(Data1["Peptide"][i],13,21)
pep = pd.DataFrame(pep)
df "pep_ID"] = Data1["pep_ID"][i]
df["HLA"] = Data1["HLA"][i]
df[= pd.concat([DF_k,df])
DF_k = ["pep","pep_ID","HLA"]
DF_k.columns = pd.merge(DF_k,pseudo_seq)
benchmark_data "../data/kmer_data.csv") benchmark_data.to_csv(
Different tools
We also compared the performance of TLimmuno2 with other immunogenicity prediction tools, including IEDB tool (CD4episcore) and Repitope. Note some published methods, such as Deepitope and FIONA do not have the tools publicly available, and are thus not included in the subsequent comparison analysis. we can’t link their website for FIONA and we can’t find the Github repositories for Deepitope.
IEDB online tool(CD4episcore)
We run IEDB in their website IEDB online tool, and this tool only support 15mer peptides.
Repitope
You can find the Repitope in their Github, This tool is very slow and potentially problematic on larger datasets, so we only used 15mer peptides.
#PBS
#R
library(tidyverse)
library(data.table)
library(Repitope)
options(java.parameters="-Xmx60G")
<- fst::read_fst("~/Project/MHCII/Repitope/FragmentLibrary_TCRSet_Public_RepitopeV3.fst", as.data.table=T)
fragLibDT <- fst::read_fst("~/Project/MHCII/Repitope/FeatureDF_MHCII_Weighted.10000_RepitopeV3.fst", as.data.table=T)
featureDF_MHCII
<- read_csv("~/Project/MHCII/Repitope/15mer_data.csv")
data <- data$pep
peptideSet_of_interest =peptideSet_of_interest
peptideSet= sapply(data$pep, nchar)
len table(len)
<- EpitopePrioritization(
res_MHCII featureDF=featureDF_MHCII[Peptide%in%MHCII_Human$Peptide,],
metadataDF=MHCII_Human[,.(Peptide,Immunogenicity)],
peptideSet=peptideSet_of_interest,
peptideLengthSet=11:30,
fragLib=fragLibDT,
aaIndexIDSet="all",
fragLenSet=3:11,
fragDepth=10000,
fragLibType="Weighted",
featureSet=MHCII_Human_MinimumFeatureSet,
seedSet=1:1,
coreN=1,
outDir="~/Project/MHCII/Repitope/pre_result" ## Intermediate and final output files will be stored under this directory
)
netMHCIIpan BA&EL
This part code is similar to the IEDB-benchmark dataset, only change the input and output file, so we don’t add code in this part.
TLimmuno2
This part code is similar to the IEDB-benchmark dataset, only change the input and output file, so we don’t add code in this part.
Result
#python
def IMM_process(Data,peptide):
= Data["pep_ID"].unique()
sam_pep_ID = []
res = []
p_ID "length"] = Data["pep"].map(len)
Data[for a in sam_pep_ID:
= []
Mean = Data[Data["pep_ID"] == a]
ID for i in ID["HLA"].unique():
= ID[ID["HLA"] == i]
Data_HLA = []
prediction for l in Data_HLA["length"].unique():
= Data_HLA[Data_HLA["length"] == l]
Data_len = Data_len["prediction"].max()
pre
prediction.append(pre)= np.max(prediction)
x
Mean.append(x)max(Mean))
res.append(
p_ID.append(a)= pd.DataFrame({"pep_ID":p_ID,"prediction":res})
result = result.merge(peptide,how = "inner",on = "pep_ID" )
result return result
def auc_pr(data,true_label,prediction,rank = False):
if rank:
= roc_curve(data[true_label],1-data[prediction])
fpr,tpr,_ = precision_recall_curve(data[true_label],1-data[prediction])
precision,recall,_ else:
= roc_curve(data[true_label],data[prediction])
fpr,tpr,_ = precision_recall_curve(data[true_label],data[prediction])
precision,recall,_
= auc(fpr,tpr)
AUC = auc(recall,precision)
PR
return fpr,tpr,recall,precision,AUC,PR
= pd.read_csv("../data/15mer_result.csv")
result = IMM_process(result,peptide)
result_15mer = auc_pr(result_15mer,"Immunogenicity","prediction")
mer15_result print("AUC:{}".format(mer15_result[4]))
#> AUC:0.7332489451476794
print("PR_AUC:{}".format(mer15_result[5]))
#> PR_AUC:0.198391863815141
#python
= pd.read_csv("../data/kmer_result.csv")
result = IMM_process(result,peptide)
result_kmer = result_kmer[result_kmer["pep_ID"].isin(result_15mer["pep_ID"])]
result_kmer = auc_pr(result_kmer,"Immunogenicity","prediction")
kmer_result print("AUC:{}".format(kmer_result[4]))
#> AUC:0.7392635212888378
print("PR_AUC:{}".format(kmer_result[5]))
#> PR_AUC:0.20210218456922854
#python
= pd.read_csv("../data/IEDB_CD4_tools_manually.csv")
IEDB_res = []
Imm_score = []
pro_num for i in IEDB_res["Protein Number"].unique():
= IEDB_res[IEDB_res["Protein Number"] == i]
y = y["Immunogenicity Score"].mean()
imm_score
pro_num.append(i)
Imm_score.append(imm_score)= pd.DataFrame({"num":pro_num,"score":Imm_score})
IEDB_result = pd.merge(IEDB_result,IEDB_pep)
IEDB_result = IEDB_result[IEDB_result["Peptide"].isin(result_15mer["Peptide"])]
IEDB_result = auc_pr(IEDB_result,"Immunogenicity","score",rank = True)
IEDB_score print("AUC:{}".format(IEDB_score[4]))
#> AUC:0.608331415420023
print("PR_AUC:{}".format(IEDB_score[5]))
#> PR_AUC:0.13176952789635926
#python
= pd.read_csv("../data/repitope_15mer_result.csv")
Repitope_result = pd.merge(Repitope_result,DF15,left_on = "Peptide",right_on = "pep")
Repitope = Repitope["pep_ID"].unique()
sam_pep_ID = []
res = []
p_ID for a in sam_pep_ID:
= []
Mean = Repitope[Repitope["pep_ID"] == a]
ID for i in ID["HLA"].unique():
= ID[ID["HLA"] == i]
x = x["ImmunogenicityScore"].mean()
x_mean
Mean.append(x_mean)max(Mean))
res.append(
p_ID.append(a)= {"pep_ID":p_ID,"ImmunogenicityScore":res}
dic = pd.DataFrame(dic)
Repitope_pre = Repitope_pre.merge(peptide,how = "inner",on = "pep_ID" )
Repitope_result = auc_pr(Repitope_result,"Immunogenicity","ImmunogenicityScore")
Repitope_score print("AUC:{}".format(Repitope_score[4]))
#> AUC:0.48418872266973534
print("PR_AUC:{}".format(Repitope_score[5]))
#> PR_AUC:0.08742710133794984
#python
def netMHCIIpan_process(result,allele,ori_sorted,peptide):
#combine two type allele
= pd.merge(result,allele,left_on="allele",right_on="Allele Name")
result = result.sort_values(by=["peptide","HLA"]).reset_index(drop = True)
result_sorted #combine with ori data
= pd.merge(result_sorted,ori_sorted[["pep","pep_ID"]],left_on = "peptide",right_on ="pep")
result_combined #result_combined["pep_length"] = result_combined["peptide"].map(len)
= result_combined["pep_ID"].unique()
sam_pep_ID = []
res = []
p_ID for a in sam_pep_ID:
= []
Mean = result_combined[result_combined["pep_ID"] == a]
ID #get Mean of all splited pep and Max of all HLA
for i in ID["Allele Name"].unique():
= ID[ID["Allele Name"] == i]
Data_HLA = []
prediction for l in Data_HLA["pep_length"].unique():
= Data_HLA[Data_HLA["pep_length"] == l]
Data_len = Data_len["percentile_rank"].min()
pre
prediction.append(pre)= np.min(prediction)
x min(x))
res.append(np.
p_ID.append(a) = pd.DataFrame({"pep_ID":p_ID,"prediction":res})
final_result = final_result.merge(peptide,how = "inner",on = "pep_ID" )
final_result return final_result
= DF15.sort_values(by=["pep","HLA"]).reset_index(drop = True)
DF15_sorted = pd.read_csv("/Users/wangguangshuai/Data/MHCII/Data/IMM_model_benchmark/allele.csv")
allele = pd.read_csv("/Users/wangguangshuai/Data/MHCII/Data/IMM_model_benchmark/netMHCIIpan_15_mer_ba.csv")
netMHCIIpan_ba "pep_length"] = netMHCIIpan_ba["peptide"].map(len)
netMHCIIpan_ba[= netMHCIIpan_process(netMHCIIpan_ba,allele,DF15_sorted,peptide)
net_15mer_ba = auc_pr(net_15mer_ba,"Immunogenicity","prediction",rank = True)
mer15_ba_result print("AUC:{}".format(mer15_ba_result[4]))
#> AUC:0.7059071729957805
print("PR_AUC:{}".format(mer15_ba_result[5]))
#> PR_AUC:0.19857401640806183
= pd.read_csv("/Users/wangguangshuai/Data/MHCII/Data/IMM_model_benchmark/netMHCIIpan_15_mer_el.csv")
netMHCIIpan_el "pep_length"] = netMHCIIpan_el["peptide"].map(len)
netMHCIIpan_el[= netMHCIIpan_process(netMHCIIpan_el,allele,DF15_sorted,peptide)
net_15mer_el = auc_pr(net_15mer_el,"Immunogenicity","prediction",rank = True)
mer15_el_result print("AUC:{}".format(mer15_el_result[4]))
#> AUC:0.6991791331031837
print("PR_AUC:{}".format(mer15_el_result[5]))
#> PR_AUC:0.16379884083064927
= DF_k.sort_values(by=["pep","HLA"]).reset_index(drop = True)
DF_k_sorted = pd.read_csv("/Users/wangguangshuai/Data/MHCII/Data/IMM_model_benchmark/netMHCIIpan_k_mer_ba.csv")
netMHCIIpan_ba_k "pep_length"] =netMHCIIpan_ba_k["peptide"].map(len)
netMHCIIpan_ba_k[= netMHCIIpan_process(netMHCIIpan_ba_k,allele,DF_k_sorted,peptide)
net_kmer_ba = auc_pr(net_kmer_ba,"Immunogenicity","prediction",rank = True)
kmer_ba_result = pd.read_csv("../data/netMHCIIpan_k_mer_el.csv")
netMHCIIpan_el_k "pep_length"] =netMHCIIpan_el_k["peptide"].map(len)
netMHCIIpan_el_k[= netMHCIIpan_process(netMHCIIpan_el_k,allele,DF_k_sorted,peptide)
net_kmer_el = auc_pr(net_kmer_el,"Immunogenicity","prediction",rank = True)
kmer_el_result print("BA AUC: {}".format(kmer_ba_result[4]))
#> BA AUC: 0.7056808778005946
print("BA PR_AUC: {}".format(kmer_ba_result[5]))
#> BA PR_AUC: 0.1910149930669585
print("EL AUC: {}".format(kmer_el_result[4]))
#> EL AUC: 0.6999570907530572
print("EL PR_AUC: {}".format(kmer_el_result[5]))
#> EL PR_AUC: 0.1594417550386672
= pd.read_csv("../data/kmer_data.csv")
kmer_data "pep"].to_csv("../data/kmer_pep_only.csv",header = None,index = None) kmer_data[
In these comparisons, TLimmuno2 shows the best performance both in the 15-mer method and the k-mer method. Repitope and IEDB tools evaluate epitope immunogenicity using population MHC information, not considering the personalized MHC allele information, and have a poor performance in these comparisons.
#python
= plt.subplots(figsize = (8,8))
fig,ax = 2
lw 0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
ax.plot([#> [<matplotlib.lines.Line2D object at 0x17871a700>]
0],mer15_result[1],label='TLimmuno2 : (AUC={0:.4f})'.format(mer15_result[4]), lw=lw)
ax.plot(mer15_result[#> [<matplotlib.lines.Line2D object at 0x17871aa60>]
0],IEDB_score[1],label='IEDB : (AUC={0:.4f})'.format(IEDB_score[4]), lw=lw)
ax.plot(IEDB_score[#> [<matplotlib.lines.Line2D object at 0x17871abe0>]
0],Repitope_score[1],label='Repitope : (AUC={0:.4f})'.format(Repitope_score[4]), lw=lw)
ax.plot(Repitope_score[#> [<matplotlib.lines.Line2D object at 0x17871af70>]
0],mer15_ba_result[1],label='NetMHCIIpan_ba : (AUC={0:.4f})'.format(mer15_ba_result[4]), lw=lw)
ax.plot(mer15_ba_result[#> [<matplotlib.lines.Line2D object at 0x17871a550>]
0],mer15_el_result[1],label='NetMHCIIpan_el : (AUC={0:.4f})'.format(mer15_el_result[4]), lw=lw)
ax.plot(mer15_el_result[#> [<matplotlib.lines.Line2D object at 0x17871a1c0>]
0,1.05])
ax.set_ylim([#> (0.0, 1.05)
0,1])
ax.set_xlim([#> (0.0, 1.0)
'False Positive Rate',fontsize = 14)
ax.set_xlabel(#> Text(0.5, 0, 'False Positive Rate')
'True Positive Rate',fontsize = 14)
ax.set_ylabel(#> Text(0, 0.5, 'True Positive Rate')
="lower right",fontsize = 12)
ax.legend(loc#> <matplotlib.legend.Legend object at 0x178704ee0>
"../figure/benchmark_15mer.pdf",dpi = 300,transparent=True)
plt.savefig( plt.show()
#python
= plt.subplots(figsize = (8,8))
fig,ax = 2
lw 0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
ax.plot([#> [<matplotlib.lines.Line2D object at 0x178744700>]
0],kmer_result[1],label='TLimmuno2 : (AUC={0:.4f})'.format(kmer_result[4]))
ax.plot(kmer_result[#> [<matplotlib.lines.Line2D object at 0x178744820>]
0],kmer_ba_result[1],label='NetMHCIIpan_ba : (AUC={0:.4f})'.format(kmer_ba_result[4]))
ax.plot(kmer_ba_result[#> [<matplotlib.lines.Line2D object at 0x178744d60>]
0],kmer_el_result[1],label='NetMHCIIpan_el : (AUC={0:.4f})'.format(kmer_el_result[4]))
ax.plot(kmer_el_result[#> [<matplotlib.lines.Line2D object at 0x178ba7f10>]
0,1.05])
ax.set_ylim([#> (0.0, 1.05)
0,1])
ax.set_xlim([#> (0.0, 1.0)
'False Positive Rate',fontsize = 14)
ax.set_xlabel(#> Text(0.5, 0, 'False Positive Rate')
'True Positive Rate',fontsize = 14)
ax.set_ylabel(#> Text(0, 0.5, 'True Positive Rate')
="lower right",fontsize = 12)
ax.legend(loc#> <matplotlib.legend.Legend object at 0x17878cb50>
"../figure/benchmark_kmer.pdf",dpi = 300,transparent=True)
plt.savefig( plt.show()
#python
= plt.subplots(figsize = (8,8))
fig,ax = 2
lw 2],mer15_result[3],label='TIMM2pred : (AUC={0:.4f})'.format(mer15_result[5]))
ax.plot(mer15_result[#> [<matplotlib.lines.Line2D object at 0x17e80f400>]
2],IEDB_score[3],label='IEDB : (AUC={0:.4f})'.format(IEDB_score[5]))
ax.plot(IEDB_score[#> [<matplotlib.lines.Line2D object at 0x28e9891c0>]
2],Repitope_score[3],label='Repitope : (AUC={0:.4f})'.format(Repitope_score[5]))
ax.plot(Repitope_score[#> [<matplotlib.lines.Line2D object at 0x28e9895b0>]
2],mer15_ba_result[3],label='NetMHCIIpan_ba : (AUC={0:.4f})'.format(mer15_ba_result[5]))
ax.plot(mer15_ba_result[#> [<matplotlib.lines.Line2D object at 0x28e989130>]
2],mer15_el_result[3],label='NetMHCIIpan_el : (AUC={0:.4f})'.format(mer15_el_result[5]))
ax.plot(mer15_el_result[#> [<matplotlib.lines.Line2D object at 0x28e989070>]
0.0, 1.0])
plt.xlim([#> (0.0, 1.0)
0.0, 1.05])
plt.ylim([#> (0.0, 1.05)
'Recall',fontsize = 14)
plt.xlabel(#> Text(0.5, 0, 'Recall')
'Precision',fontsize = 14)
plt.ylabel(#> Text(0, 0.5, 'Precision')
="upper right",fontsize = 12)
plt.legend(loc#> <matplotlib.legend.Legend object at 0x17ea0b640>
"../figure/15mer_PR.pdf",dpi = 300,transparent=True)
plt.savefig( plt.show()
#python
= plt.subplots(figsize = (8,8))
fig,ax = 2
lw 2],kmer_result[3],label='TIMM2pred : (AUC={0:.4f})'.format(kmer_result[5]))
ax.plot(kmer_result[#> [<matplotlib.lines.Line2D object at 0x1787d52b0>]
2],kmer_ba_result[3],label='NetMHCIIpan_ba : (AUC={0:.4f})'.format(kmer_ba_result[5]))
ax.plot(kmer_ba_result[#> [<matplotlib.lines.Line2D object at 0x17e0f5fa0>]
2],kmer_el_result[3],label='NetMHCIIpan_el : (AUC={0:.4f})'.format(kmer_el_result[5]))
ax.plot(kmer_el_result[#> [<matplotlib.lines.Line2D object at 0x17e0f57f0>]
0.0, 1.0])
plt.xlim([#> (0.0, 1.0)
0.0, 1.05])
plt.ylim([#> (0.0, 1.05)
'Recall',fontsize = 14)
plt.xlabel(#> Text(0.5, 0, 'Recall')
'Precision',fontsize = 14)
plt.ylabel(#> Text(0, 0.5, 'Precision')
="upper right",fontsize = 12)
plt.legend(loc#> <matplotlib.legend.Legend object at 0x1787b7e80>
"../figure/kmer_pr.pdf",dpi = 300,transparent=True)
plt.savefig( plt.show()
Confusion matrix
"pre_label"] = result_15mer["prediction"].map(lambda x : 1 if x>0.5 else 0)
result_15mer[= confusion_matrix(result_15mer["Immunogenicity"],result_15mer["pre_label"])
result_15mer_CM = result_15mer_CM/len(result_15mer)
result_15mer_CM "pre_label"] = result_kmer["prediction"].map(lambda x : 1 if x>0.5 else 0)
result_kmer[= confusion_matrix(result_kmer["Immunogenicity"],result_kmer["pre_label"])
result_kmer_CM = result_kmer_CM/len(result_kmer)
result_kmer_CM "pre_label"] = IEDB_result["score"].map(lambda x : 1 if x>50 else 0)
IEDB_result[= confusion_matrix(IEDB_result["Immunogenicity"],IEDB_result["pre_label"])
IEDB_result_CM = IEDB_result_CM/len(IEDB_result)
IEDB_result_CM "pre_label"] = Repitope_result["ImmunogenicityScore"].map(lambda x : 1 if x>0.5 else 0)
Repitope_result[= confusion_matrix(Repitope_result["Immunogenicity"],Repitope_result["pre_label"])
Repitope_result_CM = Repitope_result_CM/len(Repitope_result)
Repitope_result_CM
def net_CM(data):
"pre_label"] = data["prediction"].map(lambda x : 1 if x<10 else 0)
data[= confusion_matrix(data["Immunogenicity"],data["pre_label"])
data_CM = data_CM/len(data)
data_CM return data_CM
= net_CM(net_15mer_ba)
net_15mer_ba_CM = net_CM(net_15mer_el)
net_15mer_el_CM = net_CM(net_kmer_ba)
net_kmer_ba_CM = net_CM(net_kmer_el)
net_kmer_el_CM = [result_15mer_CM,result_kmer_CM,IEDB_result_CM,Repitope_result_CM,net_15mer_ba_CM,net_15mer_el_CM,net_kmer_ba_CM,net_kmer_el_CM] CM_all
= [result_15mer_CM,IEDB_result_CM,Repitope_result_CM,net_15mer_ba_CM,net_15mer_el_CM,result_kmer_CM,net_kmer_ba_CM,net_kmer_el_CM]
CM_all = ["TLimmuno2_15mer","IEDB","Repitope","NetMHCIIpanBA_15mer","NetMHCIIpanEL_15mer","TLimmuon2_kmer","NetMHCIIpanBA_kmer","NetMHCIIpanEL_kmer"]
CM_label def CM_plot(data,label):
= plt.subplots(figsize = (4,4))
fig,ax =True,cmap="YlGnBu",vmin=0, vmax=1,cbar = False)
sns.heatmap(data,annot "True label")
plt.ylabel("Prediction label")
plt.xlabel("{}".format(label))
plt.title(
"../figure/{}.pdf".format(label),dpi = 300,transparent=True)
plt.savefig(
for i,l in enumerate(CM_all):
CM_plot(l,CM_label[i])
#> <string>:2: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
Structure immportance
Similarly, we want to demonstrate the impact of transfer learning on TLimmuno2, so we reconstructed two models, only BA and without BA. For the only BA model, its input layer contains only the transfer output of the BA model, while for the without BA, we masked the BA model and only the LSTM layer left. The detail code of this part is not inclued in the report. In independent validation datasets, the performance of these two models is significant decreased compared with TLimmuno2. These results demonstrated that the transfer learning approach improves the predictive power of TLimmuno2 in peptide-MHC II immunogenicity prediction.
= pd.read_csv("/Users/wangguangshuai/Data/MHCII/Data/IMM_model_benchmark/kmer_no_ba_result.csv")
kmer_no_ba = kmer_no_ba["pep_ID"].unique()
sam_pep_ID = kmer_no_ba["pep_ID"].unique()
sam_pep_ID = []
res = []
p_ID for a in sam_pep_ID:
= []
Mean = kmer_no_ba[kmer_no_ba["pep_ID"] == a]
ID for i in ID["HLA"].unique():
= ID[ID["HLA"] == i]
x #x_mean = x["Rank"].mean()
= x["prediction"].max()
x_mean
Mean.append(x_mean)max(Mean))
res.append(
p_ID.append(a)= {"pep_ID":p_ID,"prediction":res}
dic = pd.DataFrame(dic)
kmer_no_ba_result = kmer_no_ba_result.merge(peptide,how = "inner",on = "pep_ID" )
kmer_no_ba_result = auc_pr(kmer_no_ba_result,"Immunogenicity","prediction")
kmer_no_ba print("AUC:{}".format(kmer_no_ba[4]))
#> AUC:0.6920828538550057
print("PR_AUC:{}".format(kmer_no_ba[5]))
#> PR_AUC:0.14357445791350346
= pd.read_csv("/Users/wangguangshuai/Data/MHCII/Data/IMM_model_benchmark/kmer_onlyBA_result.csv")
kmer_only_ba = kmer_only_ba["pep_ID"].unique()
sam_pep_ID = []
res = []
p_ID for a in sam_pep_ID:
= []
Mean = kmer_only_ba[kmer_only_ba["pep_ID"] == a]
ID for i in ID["HLA"].unique():
= ID[ID["HLA"] == i]
x #x_mean = x["Rank"].mean()
= x["prediction"].max()
x_mean
Mean.append(x_mean)max(Mean))
res.append(
p_ID.append(a)= {"pep_ID":p_ID,"prediction":res}
dic = pd.DataFrame(dic)
kmer_only_ba_result = kmer_only_ba_result.merge(peptide,how = "inner",on = "pep_ID" )
kmer_only_ba_result = auc_pr(kmer_only_ba_result,"Immunogenicity","prediction")
kmer_only_ba print("AUC:{}".format(kmer_only_ba[4]))
#> AUC:0.6529267357115459
print("PR_AUC:{}".format(kmer_only_ba[5]))
#> PR_AUC:0.13744676908175116
= DelongTest(result_kmer["prediction"],kmer_no_ba_result["prediction"],result_kmer["Immunogenicity"]).show_result()
z_score_noba,p_val_noba = DelongTest(result_kmer["prediction"],kmer_only_ba_result["prediction"],result_kmer["Immunogenicity"]).show_result()
z_score_only_ba,p_val_only_ba = plt.subplots(figsize = (8,8))
fig,ax = 2
lw 0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
ax.plot([#> [<matplotlib.lines.Line2D object at 0x17876ef70>]
0],kmer_result[1],label='TLimmuno2 : (AUC={0:.4f})'.format(kmer_result[4]))
ax.plot(kmer_result[#> [<matplotlib.lines.Line2D object at 0x295ec0bb0>]
0],kmer_no_ba[1],label='TLimmuno2_noba : (AUC={0:.4f})'.format(kmer_no_ba[4]))
ax.plot(kmer_no_ba[#> [<matplotlib.lines.Line2D object at 0x17ed31a30>]
0,1.05])
ax.set_ylim([#> (0.0, 1.05)
0,1])
ax.set_xlim([#> (0.0, 1.0)
'False Positive Rate',fontsize = 14)
ax.set_xlabel(#> Text(0.5, 0, 'False Positive Rate')
'True Positive Rate',fontsize = 14)
ax.set_ylabel(#> Text(0, 0.5, 'True Positive Rate')
0.78,0.2,"P value: {:.4f}".format(p_val_noba),weight="bold")
ax.text(#> Text(0.78, 0.2, 'P value: 0.0142')
="lower right",fontsize = 12)
ax.legend(loc#> <matplotlib.legend.Legend object at 0x28eddc4f0>
"../figure/kmer_noba.pdf",dpi = 300,transparent=True)
plt.savefig( plt.show()
= plt.subplots(figsize = (8,8))
fig,ax = 2
lw 0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
ax.plot([#> [<matplotlib.lines.Line2D object at 0x295dfb0d0>]
0],kmer_result[1],label='TLimmuno2 : (AUC={0:.4f})'.format(kmer_result[4]))
ax.plot(kmer_result[#> [<matplotlib.lines.Line2D object at 0x295dfb670>]
0],kmer_only_ba[1],label='TLimmuno2_onlyba : (AUC={0:.4f})'.format(kmer_only_ba[4]))
ax.plot(kmer_only_ba[#> [<matplotlib.lines.Line2D object at 0x295dfb6a0>]
0,1.05])
ax.set_ylim([#> (0.0, 1.05)
0,1])
ax.set_xlim([#> (0.0, 1.0)
'False Positive Rate',fontsize = 14)
ax.set_xlabel(#> Text(0.5, 0, 'False Positive Rate')
'True Positive Rate',fontsize = 14)
ax.set_ylabel(#> Text(0, 0.5, 'True Positive Rate')
0.78,0.2,"P value: {:.4f}".format(p_val_only_ba),weight="bold")
ax.text(#> Text(0.78, 0.2, 'P value: 0.0042')
="lower right",fontsize = 12)
ax.legend(loc#> <matplotlib.legend.Legend object at 0x295d566d0>
"../figure/kmer_onlyba.pdf",dpi = 300,transparent=True)
plt.savefig( plt.show()
Neoepitope and wildtype peptide
To further demonstrate the discriminative power of TLimmuno2, we collected 52 MHC II neoepitopes and their corresponding wild-type peptides from published papers . These neoepitopes have been experimentally shown to elicit T cell responses. For each neoantigen and its corresponding wild-type peptide, we predicted the immunogenicity rank values by TLimmuno2. As shown in Figure below, the TLimmuno2 predicted immunogenicity rank values of mutant neopeptides are significantly lower than those of wild-type peptides, and this demonstrates the strong ability of TLimmuno2 in recognizing real neoepitope, even if the neoepitope differs from its wild-type peptide by only one amino acid.
#python
= pd.read_csv("../data/neoantigen.csv")
neoantigen = pd.read_table("../data/pseudosequence.2016.all.X.dat",header = None,names = ["allele","sequence"])
pseudo = pd.merge(neoantigen,pseudo,left_on = "HLA",right_on = "allele").dropna()
neoantigen_process "../data/neoantigen_pro.csv") neoantigen_process.to_csv(
#python
= pd.read_csv("../data/neoantigen_result.csv")
result = result[result["immunogenic"] == 1]
result = range(len(result))
result.index = pd.melt(result[["WT_Rank","Mut_Rank"]],var_name = "Type",value_name = "Rank")
Rank from scipy import stats
= stats.wilcoxon(result["WT_Rank"],result["Mut_Rank"])
_,p = plt.subplots(figsize = (4,6))
fig,ax = sns.boxplot(data = Rank,x = "Type",y = "Rank",width = 0.4,color = "skyblue",linewidth = 2,showfliers = False)
ax = sns.stripplot(data = Rank,x = "Type",y = "Rank",color = "black",jitter = False)
b
for i in range(len(result)):
0,1],[result["WT_Rank"][i],result["Mut_Rank"][i]],c ="gray",lw = 0.25)
plt.plot([#> [<matplotlib.lines.Line2D object at 0x1789714f0>]
#> [<matplotlib.lines.Line2D object at 0x1789719a0>]
#> [<matplotlib.lines.Line2D object at 0x295e83100>]
#> [<matplotlib.lines.Line2D object at 0x295e83880>]
#> [<matplotlib.lines.Line2D object at 0x295e839a0>]
#> [<matplotlib.lines.Line2D object at 0x295d46eb0>]
#> [<matplotlib.lines.Line2D object at 0x295d46e50>]
#> [<matplotlib.lines.Line2D object at 0x295d46220>]
#> [<matplotlib.lines.Line2D object at 0x295d466d0>]
#> [<matplotlib.lines.Line2D object at 0x295d46700>]
#> [<matplotlib.lines.Line2D object at 0x295e9fc70>]
#> [<matplotlib.lines.Line2D object at 0x295e9fa90>]
#> [<matplotlib.lines.Line2D object at 0x295e9f640>]
#> [<matplotlib.lines.Line2D object at 0x295e9f100>]
#> [<matplotlib.lines.Line2D object at 0x295e82fa0>]
#> [<matplotlib.lines.Line2D object at 0x295e82c10>]
#> [<matplotlib.lines.Line2D object at 0x295e82340>]
#> [<matplotlib.lines.Line2D object at 0x295e82790>]
#> [<matplotlib.lines.Line2D object at 0x296060dc0>]
#> [<matplotlib.lines.Line2D object at 0x296060a00>]
#> [<matplotlib.lines.Line2D object at 0x296060280>]
#> [<matplotlib.lines.Line2D object at 0x2960608b0>]
#> [<matplotlib.lines.Line2D object at 0x296060eb0>]
#> [<matplotlib.lines.Line2D object at 0x296060b50>]
#> [<matplotlib.lines.Line2D object at 0x295edba00>]
#> [<matplotlib.lines.Line2D object at 0x296114250>]
#> [<matplotlib.lines.Line2D object at 0x296114d30>]
#> [<matplotlib.lines.Line2D object at 0x296114790>]
#> [<matplotlib.lines.Line2D object at 0x296114460>]
#> [<matplotlib.lines.Line2D object at 0x295dfb790>]
#> [<matplotlib.lines.Line2D object at 0x295dfbf10>]
#> [<matplotlib.lines.Line2D object at 0x1787047f0>]
#> [<matplotlib.lines.Line2D object at 0x1787cb640>]
#> [<matplotlib.lines.Line2D object at 0x2d2225430>]
#> [<matplotlib.lines.Line2D object at 0x295d8f7f0>]
#> [<matplotlib.lines.Line2D object at 0x295d9a5e0>]
#> [<matplotlib.lines.Line2D object at 0x2960132b0>]
#> [<matplotlib.lines.Line2D object at 0x295d87b20>]
#> [<matplotlib.lines.Line2D object at 0x17890ed00>]
#> [<matplotlib.lines.Line2D object at 0x29602d700>]
#> [<matplotlib.lines.Line2D object at 0x29605be80>]
#> [<matplotlib.lines.Line2D object at 0x296122a90>]
#> [<matplotlib.lines.Line2D object at 0x296013c70>]
#> [<matplotlib.lines.Line2D object at 0x296013a00>]
#> [<matplotlib.lines.Line2D object at 0x17ea62f70>]
#> [<matplotlib.lines.Line2D object at 0x295eb0d90>]
#> [<matplotlib.lines.Line2D object at 0x295eb08e0>]
#> [<matplotlib.lines.Line2D object at 0x295eb03a0>]
#> [<matplotlib.lines.Line2D object at 0x295eb03d0>]
#> [<matplotlib.lines.Line2D object at 0x295eb0fa0>]
#> [<matplotlib.lines.Line2D object at 0x17890f1f0>]
#> [<matplotlib.lines.Line2D object at 0x17874ab20>]
= 0, 1
x1, x2 = Rank['Rank'].max() + 0.1, 0.1, 'k'
y, h, col +h, y+h, y], lw=1, c=col)
plt.plot([x1, x1, x2, x2], [y, y#> [<matplotlib.lines.Line2D object at 0x17e8baf10>]
+x2)*.5, y+h, "**", ha='center', va='bottom', color=col)
plt.text((x1#> Text(0.5, 1.0549238341796203, '**')
0.3,0.9,"P = {:.4f}".format(p),weight="bold")
ax.text(#> Text(0.3, 0.9, 'P = 0.0009')
0,1.2])
plt.ylim([#> (0.0, 1.2)
0,1.2,0.2))
plt.yticks(np.arange(#> ([<matplotlib.axis.YTick object at 0x178774760>, <matplotlib.axis.YTick object at 0x1787744c0>, <matplotlib.axis.YTick object at 0x1787231c0>, <matplotlib.axis.YTick object at 0x295ec09d0>, <matplotlib.axis.YTick object at 0x17876ea30>, <matplotlib.axis.YTick object at 0x17876e220>], [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])
0,1),labels = ["WT peptide","Neoantigen"])
plt.xticks((#> ([<matplotlib.axis.XTick object at 0x178723cd0>, <matplotlib.axis.XTick object at 0x1787232e0>], [Text(0, 0, 'WT peptide'), Text(1, 0, 'Neoantigen')])
plt.tight_layout()= 3, trim = True)
sns.despine(offset "../figure/neoantigen.pdf",dpi = 300,transparent=True)
plt.savefig( plt.show()
#python
= pd.read_csv("../data/neoantigen_result.csv")
result = result[result["immunogenic"] == 1]
result = pd.melt(result[["WT_prediction","Mut_prediction"]],var_name = "Type",value_name = "Score")
Rank from scipy import stats
= stats.wilcoxon(result["WT_prediction"],result["Mut_prediction"])
_,p = plt.subplots(figsize = (4,6))
fig,ax = sns.boxplot(data = Rank,x = "Type",y = "Score",width = 0.4,color = "skyblue",linewidth = 2,showfliers = False)
ax = sns.stripplot(data = Rank,x = "Type",y = "Score",color = "red")
b = Rank,x = "Type",y = "Score",color = "red")
sns.lineplot(data #> <AxesSubplot:xlabel='Type', ylabel='Score'>
= 0, 1
x1, x2 = Rank['Score'].max() + 0.1, 0.1, 'k'
y, h, col +h, y+h, y], lw=1, c=col)
plt.plot([x1, x1, x2, x2], [y, y#> [<matplotlib.lines.Line2D object at 0x17ed88550>]
+x2)*.5, y+h, "**", ha='center', va='bottom', color=col)
plt.text((x1#> Text(0.5, 1.17707355, '**')
0.3,0.9,"P = {:.4f}".format(p),weight="bold")
ax.text(#> Text(0.3, 0.9, 'P = 0.0004')
0,1.2])
plt.ylim([#> (0.0, 1.2)
0,1.2,0.2))
plt.yticks(np.arange(#> ([<matplotlib.axis.YTick object at 0x29609e280>, <matplotlib.axis.YTick object at 0x29609eac0>, <matplotlib.axis.YTick object at 0x29605ba30>, <matplotlib.axis.YTick object at 0x17e0f8760>, <matplotlib.axis.YTick object at 0x17e84baf0>, <matplotlib.axis.YTick object at 0x296035f70>], [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])
0,1),labels = ["WT peptide","Neoantigen"])
plt.xticks((#> ([<matplotlib.axis.XTick object at 0x29605b370>, <matplotlib.axis.XTick object at 0x29605b1c0>], [Text(0, 0, 'WT peptide'), Text(1, 0, 'Neoantigen')])
plt.tight_layout()= 3, trim = True)
sns.despine(offset #plt.savefig("../figure/neoantigen_prediction.pdf",dpi = 300,transparent=True)
plt.show()
Position importance
TLimmuno2
We performed in silico mutational analyses to investigate whether TLimmuno2 was able to learn the molecular characteristics of the peptide. By changing the amino acid residues at each position, we compare the differences of the predicted values, and the largest decrease corresponds to the greater importance of the position. We do this in two methods: ala-scanning and zero-setting. For ala-scanning, we sequentially replaced each position as the alanine; for zero-setting, we sequentially mutated each position (the representation vector for the mutated position is all zeros). We simulated this process 100 times and an ascending ranking was performed each time to highlight the most salient positions.
#python
#PBS
import pandas as pd
import tensorflow as tf
import sys
"/public/slst/home/wanggsh/Project/MHCII/utils")
sys.path.append(from Blosum62 import blosum62
import numpy as np
import random
import matplotlib.pyplot as plt
import matplotlib as mpl
import os
#os.environ["CUDA_VISIBLE_DEVICES"]="-1"
= pd.read_csv("/public/slst/home/wanggsh/Data/MHCII_immuno/IMM_data_random2.csv")
immuno_data = immuno_data[immuno_data["label"] == 1]
immuno_data "length"] = immuno_data["Description"].map(len)
immuno_data[#choose 15mer
= immuno_data[immuno_data["length"] == 15]
immuno_data = immuno_data.reset_index(drop=True)
immuno_data
"pep_blosum"] = immuno_data["Description"].apply(blosum62,args=(21,))
immuno_data["MHC_blosum"] = immuno_data["sequence"].apply(blosum62,args=(34,))
immuno_data[= np.empty((len(immuno_data),21,21))
peptide for i in range(len(immuno_data)):
= immuno_data["pep_blosum"][i].reshape((21,21))
peptide[i] #mhc
= np.empty((len(immuno_data),34,21))
MHC for i in range(len(immuno_data)):
= immuno_data["MHC_blosum"][i].reshape((34,21))
MHC[i] = list(blosum62("A",1))
Ala
= tf.keras.models.load_model("/public/slst/home/wanggsh/Project/Saved_model/new_BA_model/model1")
BAmodel = tf.keras.models.Model(inputs = BAmodel.input,outputs = BAmodel.layers[-2].output)
BAmodel = tf.keras.models.load_model("/public/slst/home/wanggsh/Project/Saved_model/MHCII_immuno/model")
IMM
= 100
n = np.empty([n,15])
position_rank for m in range(n):
= np.random.choice(np.arange(len(immuno_data)),2000)
index = peptide[index].copy()
peptide_sam = MHC[index]
MHC_sam = BAmodel.predict([peptide_sam,MHC_sam])
BA = IMM([peptide_sam,MHC_sam,BA]).numpy().mean()
pred_ori = []
array for i in range(15):
= 0
peptide_sam[:,i,:] = BAmodel.predict([peptide_sam,MHC_sam])
BA = pred_ori - IMM([peptide_sam,MHC_sam,BA]).numpy().mean()
importance
array.append(importance)= peptide[index].copy()
peptide_sam = np.argsort(array) + 1
ranking = []
tmp for i in range(15):
list(ranking).index(i+1))
tmp.append(= tmp
position_rank[m,:] "/public/slst/home/wanggsh/Data/MHCII_immuno/position_importance/position.npy",position_rank)
np.save(
= 100
n = np.empty([n,15])
position_rank_ala for m in range(n):
= np.random.choice(np.arange(len(immuno_data)),2000)
index = peptide[index].copy()
peptide_sam = MHC[index]
MHC_sam = BAmodel.predict([peptide_sam,MHC_sam])
BA = IMM([peptide_sam,MHC_sam,BA]).numpy().mean()
pred_ori = []
array for i in range(15):
= Ala
peptide_sam[:,i,:] = BAmodel.predict([peptide_sam,MHC_sam])
BA = pred_ori - IMM([peptide_sam,MHC_sam,BA]).numpy().mean()
importance
array.append(importance)= peptide[index].copy()
peptide_sam = np.argsort(array) + 1
ranking = []
tmp for i in range(15):
list(ranking).index(i+1))
tmp.append(= tmp
position_rank[m,:] "/public/slst/home/wanggsh/Data/MHCII_immuno/position_importance/position_ala.npy",position_rank) np.save(
plt figure
#python
from collections import Counter
= np.load("../data/position.npy")
position_rank = mpl.cm.get_cmap('tab20')
cmap = np.linspace(0,1,15)
delim = [mpl.colors.rgb2hex(cmap(i)[:4]) for i in delim]
colors = plt.subplots(figsize = (6,4))
fig,ax for i in np.arange(15):
= list(Counter(position_rank[:,i]+1).keys())
y = list(Counter(position_rank[:,i]+1).values())
s for n in range(len(y))],y, s=[m*3.5 for m in s],c=colors[i])
ax.scatter([i #> <matplotlib.collections.PathCollection object at 0x295ec06a0>
#> <matplotlib.collections.PathCollection object at 0x295ec05b0>
#> <matplotlib.collections.PathCollection object at 0x2960a8280>
#> <matplotlib.collections.PathCollection object at 0x2960a8310>
#> <matplotlib.collections.PathCollection object at 0x2960a8be0>
#> <matplotlib.collections.PathCollection object at 0x29603a760>
#> <matplotlib.collections.PathCollection object at 0x29603aac0>
#> <matplotlib.collections.PathCollection object at 0x295ec0d60>
#> <matplotlib.collections.PathCollection object at 0x295ec0430>
#> <matplotlib.collections.PathCollection object at 0x2d20e3790>
#> <matplotlib.collections.PathCollection object at 0x295edd3d0>
#> <matplotlib.collections.PathCollection object at 0x2d20e3340>
#> <matplotlib.collections.PathCollection object at 0x2964eb430>
#> <matplotlib.collections.PathCollection object at 0x295ec0730>
#> <matplotlib.collections.PathCollection object at 0x2964eb4f0>
0,16)
ax.set_ylim(#> (0.0, 16.0)
15)+1)
ax.set_yticks(np.arange(#> [<matplotlib.axis.YTick object at 0x29606faf0>, <matplotlib.axis.YTick object at 0x281fb6a30>, <matplotlib.axis.YTick object at 0x295e4d460>, <matplotlib.axis.YTick object at 0x29653bdf0>, <matplotlib.axis.YTick object at 0x2960a4b20>, <matplotlib.axis.YTick object at 0x2d20c75b0>, <matplotlib.axis.YTick object at 0x2964c3820>, <matplotlib.axis.YTick object at 0x2d20c72e0>, <matplotlib.axis.YTick object at 0x2964c3f70>, <matplotlib.axis.YTick object at 0x2964c3a60>, <matplotlib.axis.YTick object at 0x2960ad760>, <matplotlib.axis.YTick object at 0x2960ad190>, <matplotlib.axis.YTick object at 0x2d2181e50>, <matplotlib.axis.YTick object at 0x2960adb20>, <matplotlib.axis.YTick object at 0x2964c3730>]
15))
ax.set_xticks(np.arange(#> [<matplotlib.axis.XTick object at 0x281fb65b0>, <matplotlib.axis.XTick object at 0x281fb6af0>, <matplotlib.axis.XTick object at 0x29609ea00>, <matplotlib.axis.XTick object at 0x2d2181dc0>, <matplotlib.axis.XTick object at 0x2d21a2190>, <matplotlib.axis.XTick object at 0x2d21a28e0>, <matplotlib.axis.XTick object at 0x2d21a3070>, <matplotlib.axis.XTick object at 0x2d21a2b50>, <matplotlib.axis.XTick object at 0x2960ad640>, <matplotlib.axis.XTick object at 0x2964c35e0>, <matplotlib.axis.XTick object at 0x2d21a3670>, <matplotlib.axis.XTick object at 0x2d21a3cd0>, <matplotlib.axis.XTick object at 0x2d2199460>, <matplotlib.axis.XTick object at 0x2d2199bb0>, <matplotlib.axis.XTick object at 0x2d2199700>]
'1','2','3','4','5','6','7','8','9','10','11','12','13','14','15'])
ax.set_xticklabels([#> [Text(0, 0, '1'), Text(1, 0, '2'), Text(2, 0, '3'), Text(3, 0, '4'), Text(4, 0, '5'), Text(5, 0, '6'), Text(6, 0, '7'), Text(7, 0, '8'), Text(8, 0, '9'), Text(9, 0, '10'), Text(10, 0, '11'), Text(11, 0, '12'), Text(12, 0, '13'), Text(13, 0, '14'), Text(14, 0, '15')]
'Zero setting')
ax.set_xlabel(#> Text(0.5, 0, 'Zero setting')
'Ranking(ascending)')
ax.set_ylabel(#> Text(0, 0.5, 'Ranking(ascending)')
True,alpha=0.2)
ax.grid(= [ax.plot([],[],color='grey',marker='o',markersize=i,ls='')[0] for i in range(8,15,2)]
h1 = ax.legend(handles=h1,labels=[10,40,70,100],title='Frequency',loc='lower left',bbox_to_anchor=(0.8,0.6),frameon=False)
leg1
ax.add_artist(leg1)#> <matplotlib.legend.Legend object at 0x2960b0040>
"../figure/position_zero.pdf",dpi = 300,transparent=True)
plt.savefig( plt.show()
#python
= np.load("../data/position_ala.npy")
position_rank_ala = mpl.cm.get_cmap('tab20')
cmap = np.linspace(0,1,15)
delim = [mpl.colors.rgb2hex(cmap(i)[:4]) for i in delim]
colors = plt.subplots(figsize = (6,4))
fig,ax for i in np.arange(15):
= list(Counter(position_rank_ala[:,i]+1).keys())
y = list(Counter(position_rank_ala[:,i]+1).values())
s for n in range(len(y))],y, s=[m*3.5 for m in s],c=colors[i])
ax.scatter([i #> <matplotlib.collections.PathCollection object at 0x295ecfa60>
#> <matplotlib.collections.PathCollection object at 0x295ecf4f0>
#> <matplotlib.collections.PathCollection object at 0x295ecf670>
#> <matplotlib.collections.PathCollection object at 0x296231490>
#> <matplotlib.collections.PathCollection object at 0x296231ee0>
#> <matplotlib.collections.PathCollection object at 0x295ecdfa0>
#> <matplotlib.collections.PathCollection object at 0x295ecd430>
#> <matplotlib.collections.PathCollection object at 0x2d2185e50>
#> <matplotlib.collections.PathCollection object at 0x2d2185bb0>
#> <matplotlib.collections.PathCollection object at 0x2962315b0>
#> <matplotlib.collections.PathCollection object at 0x2d2185be0>
#> <matplotlib.collections.PathCollection object at 0x296085eb0>
#> <matplotlib.collections.PathCollection object at 0x296191460>
#> <matplotlib.collections.PathCollection object at 0x296191ac0>
#> <matplotlib.collections.PathCollection object at 0x296191ee0>
0,16)
ax.set_ylim(#> (0.0, 16.0)
15)+1)
ax.set_yticks(np.arange(#> [<matplotlib.axis.YTick object at 0x2d21a2430>, <matplotlib.axis.YTick object at 0x2960ad040>, <matplotlib.axis.YTick object at 0x2d20c7e50>, <matplotlib.axis.YTick object at 0x295eaa580>, <matplotlib.axis.YTick object at 0x295eaabe0>, <matplotlib.axis.YTick object at 0x29610f730>, <matplotlib.axis.YTick object at 0x296121f40>, <matplotlib.axis.YTick object at 0x296121df0>, <matplotlib.axis.YTick object at 0x296197460>, <matplotlib.axis.YTick object at 0x2961212b0>, <matplotlib.axis.YTick object at 0x29610f9d0>, <matplotlib.axis.YTick object at 0x2961972b0>, <matplotlib.axis.YTick object at 0x29613d820>, <matplotlib.axis.YTick object at 0x29613d7f0>, <matplotlib.axis.YTick object at 0x296150e50>]
15))
ax.set_xticks(np.arange(#> [<matplotlib.axis.XTick object at 0x2960adcd0>, <matplotlib.axis.XTick object at 0x2960add60>, <matplotlib.axis.XTick object at 0x2d21817f0>, <matplotlib.axis.XTick object at 0x29613d910>, <matplotlib.axis.XTick object at 0x2961979a0>, <matplotlib.axis.XTick object at 0x2d21a28b0>, <matplotlib.axis.XTick object at 0x296150ca0>, <matplotlib.axis.XTick object at 0x295edf3d0>, <matplotlib.axis.XTick object at 0x295edfb20>, <matplotlib.axis.XTick object at 0x295ec82b0>, <matplotlib.axis.XTick object at 0x295edfbe0>, <matplotlib.axis.XTick object at 0x296121d00>, <matplotlib.axis.XTick object at 0x295eb95b0>, <matplotlib.axis.XTick object at 0x295eaa2b0>, <matplotlib.axis.XTick object at 0x2961502b0>]
'1','2','3','4','5','6','7','8','9','10','11','12','13','14','15'])
ax.set_xticklabels([#> [Text(0, 0, '1'), Text(1, 0, '2'), Text(2, 0, '3'), Text(3, 0, '4'), Text(4, 0, '5'), Text(5, 0, '6'), Text(6, 0, '7'), Text(7, 0, '8'), Text(8, 0, '9'), Text(9, 0, '10'), Text(10, 0, '11'), Text(11, 0, '12'), Text(12, 0, '13'), Text(13, 0, '14'), Text(14, 0, '15')]
'Ala scanning')
ax.set_xlabel(#> Text(0.5, 0, 'Ala scanning')
'Ranking(ascending)')
ax.set_ylabel(#> Text(0, 0.5, 'Ranking(ascending)')
True,alpha=0.2)
ax.grid(= [ax.plot([],[],color='grey',marker='o',markersize=i,ls='')[0] for i in range(8,15,2)]
h1 = ax.legend(handles=h1,labels=[10,40,70,100],title='Frequency',loc='lower left',bbox_to_anchor=(0.8,0.6),frameon=False)
leg1
ax.add_artist(leg1)#> <matplotlib.legend.Legend object at 0x17e081610>
"../figure/position_ala.pdf",dpi = 300,transparent=True)
plt.savefig( plt.show()
Both results show that P4 (residue 4), P5, and P6 are very important for model prediction. This result is consistent with the results related to MHC class I molecules, and it is reported that these positions are essential for interacting with the TCR. But unlike MHC I molecules, MHC II molecules exhibit a tendency to have less effect on model predictions due to its position closer to the tail, possibly due to its open binding cleft. The results of Ala-scanning are more stable rather than zero-setting, because Ala also has biological functions.
BA model
We also do the same things by using BA model.
import pandas as pd
import tensorflow as tf
import sys
"/public/slst/home/wanggsh/Project/MHCII/utils")
sys.path.append(from Blosum62 import blosum62
import numpy as np
import random
import matplotlib.pyplot as plt
import matplotlib as mpl
import os
#os.environ["CUDA_VISIBLE_DEVICES"]="-1"
= pd.read_csv("/public/slst/home/wanggsh/Data/MHCII/netMHCpanII4.0_train_data/BA_data/BA_MHC_label.csv")
immuno_data = immuno_data[immuno_data["score"] == 1]
immuno_data "length"] = immuno_data["peptide"].map(len)
immuno_data[#choose 15mer
= immuno_data[immuno_data["length"] == 15]
immuno_data = immuno_data.reset_index(drop=True)
immuno_data
"pep_blosum"] = immuno_data["peptide"].apply(blosum62,args=(21,))
immuno_data["MHC_blosum"] = immuno_data["sequence"].apply(blosum62,args=(34,))
immuno_data[= np.empty((len(immuno_data),21,21))
peptide for i in range(len(immuno_data)):
= immuno_data["pep_blosum"][i].reshape((21,21))
peptide[i] #mhc
= np.empty((len(immuno_data),34,21))
MHC for i in range(len(immuno_data)):
= immuno_data["MHC_blosum"][i].reshape((34,21))
MHC[i] = list(blosum62("A",1))
Ala
= tf.keras.models.load_model("/public/slst/home/wanggsh/Project/Saved_model/new_BA_model/model1")
BAmodel
= 100
n = np.empty([n,15])
position_rank for m in range(n):
= np.random.choice(np.arange(len(immuno_data)),2000)
index = peptide[index].copy()
peptide_sam = MHC[index]
MHC_sam = BAmodel([peptide_sam,MHC_sam]).numpy().mean()
pred_ori = []
array for i in range(15):
= 0
peptide_sam[:,i,:] = pred_ori - BAmodel([peptide_sam,MHC_sam]).numpy().mean()
importance
array.append(importance)= peptide[index].copy()
peptide_sam = np.argsort(array) + 1
ranking = []
tmp for i in range(15):
list(ranking).index(i+1))
tmp.append(= tmp
position_rank[m,:] "/public/slst/home/wanggsh/Data/MHCII_immuno/position_importance/position_zero_BA.npy",position_rank)
np.save(
= 100
n = np.empty([n,15])
position_rank_ala for m in range(n):
= np.random.choice(np.arange(len(immuno_data)),2000)
index = peptide[index].copy()
peptide_sam = MHC[index]
MHC_sam = BAmodel([peptide_sam,MHC_sam]).numpy().mean()
pred_ori = []
array for i in range(15):
= Ala
peptide_sam[:,i,:] = pred_ori - BAmodel([peptide_sam,MHC_sam]).numpy().mean()
importance
array.append(importance)= peptide[index].copy()
peptide_sam = np.argsort(array) + 1
ranking = []
tmp for i in range(15):
list(ranking).index(i+1))
tmp.append(= tmp
position_rank[m,:] "/public/slst/home/wanggsh/Data/MHCII_immuno/position_importance/position_ala_BA.npy",position_rank) np.save(
#python
from collections import Counter
= np.load("../data/position_zero_BA.npy")
position_rank = mpl.cm.get_cmap('tab20')
cmap = np.linspace(0,1,15)
delim = [mpl.colors.rgb2hex(cmap(i)[:4]) for i in delim]
colors = plt.subplots(figsize = (6,6))
fig,ax for i in np.arange(15):
= list(Counter(position_rank[:,i]+1).keys())
y = list(Counter(position_rank[:,i]+1).values())
s for n in range(len(y))],y, s=[m*3.5 for m in s],c=colors[i])
ax.scatter([i #> <matplotlib.collections.PathCollection object at 0x295dc8160>
#> <matplotlib.collections.PathCollection object at 0x295eb0730>
#> <matplotlib.collections.PathCollection object at 0x295eb08b0>
#> <matplotlib.collections.PathCollection object at 0x296121fd0>
#> <matplotlib.collections.PathCollection object at 0x295eb00a0>
#> <matplotlib.collections.PathCollection object at 0x295d0b970>
#> <matplotlib.collections.PathCollection object at 0x178971f10>
#> <matplotlib.collections.PathCollection object at 0x178971340>
#> <matplotlib.collections.PathCollection object at 0x17e91cdf0>
#> <matplotlib.collections.PathCollection object at 0x178971b50>
#> <matplotlib.collections.PathCollection object at 0x295dc8d30>
#> <matplotlib.collections.PathCollection object at 0x2960339d0>
#> <matplotlib.collections.PathCollection object at 0x296035250>
#> <matplotlib.collections.PathCollection object at 0x2960ad250>
#> <matplotlib.collections.PathCollection object at 0x29609e6d0>
0,16)
ax.set_ylim(#> (0.0, 16.0)
15)+1)
ax.set_yticks(np.arange(#> [<matplotlib.axis.YTick object at 0x295eb9280>, <matplotlib.axis.YTick object at 0x295eb97c0>, <matplotlib.axis.YTick object at 0x295eaa790>, <matplotlib.axis.YTick object at 0x295edb8e0>, <matplotlib.axis.YTick object at 0x295edb880>, <matplotlib.axis.YTick object at 0x1789382b0>, <matplotlib.axis.YTick object at 0x17ed313a0>, <matplotlib.axis.YTick object at 0x295e9f0a0>, <matplotlib.axis.YTick object at 0x295e9f580>, <matplotlib.axis.YTick object at 0x29602dc10>, <matplotlib.axis.YTick object at 0x295e9fcd0>, <matplotlib.axis.YTick object at 0x178938fa0>, <matplotlib.axis.YTick object at 0x29602dc40>, <matplotlib.axis.YTick object at 0x28d2af430>, <matplotlib.axis.YTick object at 0x17890e9d0>]
15))
ax.set_xticks(np.arange(#> [<matplotlib.axis.XTick object at 0x295ecda90>, <matplotlib.axis.XTick object at 0x295ecd970>, <matplotlib.axis.XTick object at 0x29606fe20>, <matplotlib.axis.XTick object at 0x296122f40>, <matplotlib.axis.XTick object at 0x296122df0>, <matplotlib.axis.XTick object at 0x29602ddf0>, <matplotlib.axis.XTick object at 0x295ec0130>, <matplotlib.axis.XTick object at 0x295cb2d90>, <matplotlib.axis.XTick object at 0x178907fa0>, <matplotlib.axis.XTick object at 0x1789076a0>, <matplotlib.axis.XTick object at 0x296070670>, <matplotlib.axis.XTick object at 0x1789076d0>, <matplotlib.axis.XTick object at 0x295ec0610>, <matplotlib.axis.XTick object at 0x296070a00>, <matplotlib.axis.XTick object at 0x178930370>]
'1','2','3','4','5','6','7','8','9','10','11','12','13','14','15'])
ax.set_xticklabels([#> [Text(0, 0, '1'), Text(1, 0, '2'), Text(2, 0, '3'), Text(3, 0, '4'), Text(4, 0, '5'), Text(5, 0, '6'), Text(6, 0, '7'), Text(7, 0, '8'), Text(8, 0, '9'), Text(9, 0, '10'), Text(10, 0, '11'), Text(11, 0, '12'), Text(12, 0, '13'), Text(13, 0, '14'), Text(14, 0, '15')]
'Zero setting')
ax.set_xlabel(#> Text(0.5, 0, 'Zero setting')
'Ranking(ascending)')
ax.set_ylabel(#> Text(0, 0.5, 'Ranking(ascending)')
True,alpha=0.2)
ax.grid(= [ax.plot([],[],color='grey',marker='o',markersize=i,ls='')[0] for i in range(8,15,2)]
h1 = ax.legend(handles=h1,labels=[10,40,70,100],title='Frequency',loc='lower left',bbox_to_anchor=(0.98,0.2),frameon=False)
leg1
ax.add_artist(leg1)#> <matplotlib.legend.Legend object at 0x29609eee0>
"../figure/position_zero_BA.pdf",dpi = 300,bbox_inches='tight',transparent=True)
plt.savefig( plt.show()
#python
= np.load("../data/position_ala_BA.npy")
position_rank_ala = mpl.cm.get_cmap('tab20')
cmap = np.linspace(0,1,15)
delim = [mpl.colors.rgb2hex(cmap(i)[:4]) for i in delim]
colors = plt.subplots(figsize = (6,6))
fig,ax for i in np.arange(15):
= list(Counter(position_rank_ala[:,i]+1).keys())
y = list(Counter(position_rank_ala[:,i]+1).values())
s for n in range(len(y))],y, s=[m*3.5 for m in s],c=colors[i])
ax.scatter([i #> <matplotlib.collections.PathCollection object at 0x29602d370>
#> <matplotlib.collections.PathCollection object at 0x29602d280>
#> <matplotlib.collections.PathCollection object at 0x2af891a30>
#> <matplotlib.collections.PathCollection object at 0x29602d430>
#> <matplotlib.collections.PathCollection object at 0x2af8af5b0>
#> <matplotlib.collections.PathCollection object at 0x2af8afd60>
#> <matplotlib.collections.PathCollection object at 0x2af8afe80>
#> <matplotlib.collections.PathCollection object at 0x178938340>
#> <matplotlib.collections.PathCollection object at 0x178938f10>
#> <matplotlib.collections.PathCollection object at 0x2af8af7f0>
#> <matplotlib.collections.PathCollection object at 0x2af8afd90>
#> <matplotlib.collections.PathCollection object at 0x295d70670>
#> <matplotlib.collections.PathCollection object at 0x295d70a30>
#> <matplotlib.collections.PathCollection object at 0x29603a730>
#> <matplotlib.collections.PathCollection object at 0x295d70550>
0,16)
ax.set_ylim(#> (0.0, 16.0)
15)+1)
ax.set_yticks(np.arange(#> [<matplotlib.axis.YTick object at 0x295e4d5b0>, <matplotlib.axis.YTick object at 0x295d0b8e0>, <matplotlib.axis.YTick object at 0x296060fa0>, <matplotlib.axis.YTick object at 0x2961215b0>, <matplotlib.axis.YTick object at 0x296114460>, <matplotlib.axis.YTick object at 0x296060700>, <matplotlib.axis.YTick object at 0x17ecf2910>, <matplotlib.axis.YTick object at 0x17ed88970>, <matplotlib.axis.YTick object at 0x17ecf27f0>, <matplotlib.axis.YTick object at 0x296114250>, <matplotlib.axis.YTick object at 0x295df31f0>, <matplotlib.axis.YTick object at 0x2960a4610>, <matplotlib.axis.YTick object at 0x2960a4160>, <matplotlib.axis.YTick object at 0x29603a160>, <matplotlib.axis.YTick object at 0x295ec0640>]
15))
ax.set_xticks(np.arange(#> [<matplotlib.axis.XTick object at 0x296060fd0>, <matplotlib.axis.XTick object at 0x296060af0>, <matplotlib.axis.XTick object at 0x2960a4d00>, <matplotlib.axis.XTick object at 0x296121c10>, <matplotlib.axis.XTick object at 0x295ec04f0>, <matplotlib.axis.XTick object at 0x29653b280>, <matplotlib.axis.XTick object at 0x2960b0790>, <matplotlib.axis.XTick object at 0x17e9902b0>, <matplotlib.axis.XTick object at 0x2960b0eb0>, <matplotlib.axis.XTick object at 0x29653b160>, <matplotlib.axis.XTick object at 0x2d2191b50>, <matplotlib.axis.XTick object at 0x2d21917c0>, <matplotlib.axis.XTick object at 0x295eb9940>, <matplotlib.axis.XTick object at 0x295d0b0a0>, <matplotlib.axis.XTick object at 0x295ecdca0>]
'1','2','3','4','5','6','7','8','9','10','11','12','13','14','15'])
ax.set_xticklabels([#> [Text(0, 0, '1'), Text(1, 0, '2'), Text(2, 0, '3'), Text(3, 0, '4'), Text(4, 0, '5'), Text(5, 0, '6'), Text(6, 0, '7'), Text(7, 0, '8'), Text(8, 0, '9'), Text(9, 0, '10'), Text(10, 0, '11'), Text(11, 0, '12'), Text(12, 0, '13'), Text(13, 0, '14'), Text(14, 0, '15')]
'Ala scaning')
ax.set_xlabel(#> Text(0.5, 0, 'Ala scaning')
'Ranking(ascending)')
ax.set_ylabel(#> Text(0, 0.5, 'Ranking(ascending)')
True,alpha=0.2)
ax.grid(= [ax.plot([],[],color='grey',marker='o',markersize=i,ls='')[0] for i in range(8,15,2)]
h1 = ax.legend(handles=h1,labels=[10,40,70,100],title='Frequency',loc='lower left',bbox_to_anchor=(0.98,0.2),frameon=False)
leg1
ax.add_artist(leg1)#> <matplotlib.legend.Legend object at 0x1787df040>
"../figure/position_ala_BA.pdf",dpi = 300,bbox_inches='tight',transparent=True)
plt.savefig( plt.show()
Furthermore, the BA model does not show the same trend, it is possible that the features learned by the affinity prediction model are the interaction between MHC and peptide compared to the features learned by the immunogenicity prediction model.
Compare with another meachine learning and other deep learning methods
To further investigate the importance of model structure to TLimmuno2 performance, we used other model architectures and compared their performance. We constructed four traditional machine learning classification models (AdaBoost, KNN, Random Forest, and SVM) and tuned their hyperparameters by cross-validation. In addition, we constructed deep learning models of deep neural network (DNN) and convolutional neural network (CNN). All models were trained using the same data. We validated these models in our curated neoepitope dataset.
from sklearn.metrics import roc_auc_score,make_scorer
= make_scorer(roc_auc_score)
scorer = pd.read_csv("../data/BLOSUM62.csv",comment="#")
Blosum62_matrix = list("ARNDCQEGHILKMFPSTWYVX")
Protein_alphabet = Blosum62_matrix[Protein_alphabet]
Blosum62_matrix = Blosum62_matrix.loc[Protein_alphabet]
Blosum62_matrix
def blosum62(peptide,maxlen):
= np.empty((maxlen,21))
encoder if len(peptide) <=maxlen:
= peptide + "X"*(maxlen-len(peptide))
peptide for i in range(len(peptide)):
= list(peptide)[i]
pep = Blosum62_matrix[pep]
coder = coder
encoder[i] return encoder.flatten()
def auc_pr(data,true_label,prediction,rank = False):
if rank:
= roc_curve(data[true_label],1-data[prediction])
fpr,tpr,_ = precision_recall_curve(data[true_label],1-data[prediction])
precision,recall,_ else:
= roc_curve(data[true_label],data[prediction])
fpr,tpr,_ = precision_recall_curve(data[true_label],data[prediction])
precision,recall,_
= auc(fpr,tpr)
AUC = auc(recall,precision)
PR
return fpr,tpr,recall,precision,AUC,PR
def IMM_process(Data,peptide):
= Data["pep_ID"].unique()
sam_pep_ID = []
res = []
p_ID "length"] = Data["pep"].map(len)
Data[for a in sam_pep_ID:
= []
Mean = Data[Data["pep_ID"] == a]
ID for i in ID["HLA"].unique():
= ID[ID["HLA"] == i]
Data_HLA = []
prediction for l in Data_HLA["length"].unique():
= Data_HLA[Data_HLA["length"] == l]
Data_len = Data_len["prediction"].max()
pre
prediction.append(pre)#x_mean = x["Rank"].mean()
= np.mean(prediction)
x_mean
Mean.append(x_mean)max(Mean))
res.append(
p_ID.append(a)= pd.DataFrame({"pep_ID":p_ID,"prediction":res})
result = result.merge(peptide,how = "inner",on = "pep_ID" )
result
return result
def evaluate(estimator,test_X,test_Y):
= estimator.predict(test_X)
pred = roc_auc_score(test_Y,pred)
result return result
def evaluate2(estimator,data,peptide,test_X):
"prediction"] = estimator.predict(test_X)
data[= IMM_process(data,peptide)
result = auc_pr(result,"Immunogenicity","prediction")
result return result[4]
def method(estimator):
= KFold(n_splits=5)
kf = list(kf.split(np.arange(X.shape[0])))
fold_indices = {'validation': [], 'mer15': [],'kmer': []}
holding for fold in fold_indices:
# split
= X[fold[0]], np.array(Y)[fold[0]], X[fold[1]], np.array(Y)[fold[1]]
train_X, train_Y, test_X, test_Y # train
estimator.fit(train_X, train_Y)# test in validation set
= evaluate(estimator,test_X,test_Y)
result_validation 'validation'].append(result_validation)
holding[# test in mer15
= evaluate2(estimator,mer15,peptide,mer15_X)
AUC 'mer15'].append(AUC)
holding[# test in kmer
= evaluate2(estimator,kmer,peptide,kmer_X)
AUC 'kmer'].append(AUC)
holding[return holding
Data process
= pd.read_csv("../data/peptide.csv")
peptide = pd.read_csv("../data/IMM_data_random2.csv")
immuno_data "pep_blosum"] = immuno_data["Description"].apply(blosum62,args=(21,))
immuno_data["MHC_blosum"] = immuno_data["sequence"].apply(blosum62,args=(34,))
immuno_data[= np.empty((len(immuno_data),1155))
X for i in range(len(immuno_data)):
= immuno_data.iloc[i,6]
x = immuno_data.iloc[i,7]
y = np.concatenate((x,y))
X[i, :] = immuno_data["label"].values
Y = pd.read_csv("../data/15mer_data.csv")
mer15 "pep_blosum"] = mer15["pep"].apply(blosum62,args=(21,))
mer15["MHC_blosum"] = mer15["sequence"].apply(blosum62,args=(34,))
mer15[= np.empty((len(mer15),1155))
mer15_X for i in range(len(mer15)):
= mer15.iloc[i,5]
x = mer15.iloc[i,6]
y = np.concatenate((x,y))
mer15_X[i,:] = pd.read_csv("../data/kmer_data.csv")
kmer "pep_blosum"] = kmer["pep"].apply(blosum62,args=(21,))
kmer["MHC_blosum"] = kmer["sequence"].apply(blosum62,args=(34,))
kmer[= np.empty((len(kmer),1155))
kmer_X for i in range(len(kmer)):
= kmer.iloc[i,5]
x = kmer.iloc[i,6]
y = np.concatenate((x,y))
kmer_X[i,:] = {} holder
Machine learning
Adaboost Classifier
from sklearn.model_selection import cross_validate
from sklearn.ensemble import AdaBoostClassifier
= []
cv_results = np.linspace(20, 220, 6)
space for i in space:
= cross_validate(AdaBoostClassifier(n_estimators=int(i)), X, Y, cv=3, scoring=scorer, n_jobs=-1,
cv_result =5)
verbose
cv_results.append(cv_result)
= [item['test_score'].mean() for item in cv_results]
y1 = [item['test_score'].std() for item in cv_results]
y1_e
= plt.subplot(1, 1, 1)
ax1 len(space)), y1, marker='o', markersize=5)
ax1.plot(np.arange(len(space)), [y1[i] - y1_e[i] for i in range(len(space))],
ax1.fill_between(np.arange(+ y1_e[i] for i in range(len(space))], alpha=0.2) #
[y1[i] len(space)))
ax1.set_xticks(np.arange('{0:.2f}'.format(i) for i in space])
ax1.set_xticklabels([
plot.show()= AdaBoostClassifier(n_estimators=180)
estimator 'Adaboost'] = method(estimator = estimator) holder[
KNN Classifier
# KNN Classifier
= []
cv_results from sklearn.neighbors import KNeighborsClassifier
= np.linspace(1,100,10)
space for i in space:
= cross_validate(KNeighborsClassifier(n_neighbors=int(i)),X,Y,cv=5,scoring=rmse,n_jobs=-1,verbose=5)
cv_result
cv_results.append(cv_result)= [item['test_score'].mean() for item in cv_results]
y1 = [item['test_score'].std() for item in cv_results]
y1_e = plt.subplot(1,1,1)
ax1 len(space)),y1,marker='o',markersize=5)
ax1.plot(np.arange(len(space)),[y1[i]-y1_e[i] for i in range(len(space))],[y1[i]+y1_e[i] for i in range(len(space))],alpha=0.2)
ax1.fill_between(np.arange(len(space)))
ax1.set_xticks(np.arange('{0:.2f}'.format(i) for i in space])
ax1.set_xticklabels([= KNeighborsClassifier(n_neighbors=1)
estimator 'KNN'] = method(estimator = estimator) holder[
Random forest Classifier
= []
cv_results from sklearn.ensemble import RandomForestClassifier
= np.linspace(1, 100, 20)
space for i in space:
= cross_validate(RandomForestClassifier(n_estimators=200,min_samples_leaf=int(i)), X, Y, cv=3, scoring=rmse, n_jobs=-1,
cv_result =5)
verbose
cv_results.append(cv_result)= [item['test_score'].mean() for item in cv_results]
y1 = [item['test_score'].std() for item in cv_results]
y1_e = plt.subplot(1, 1, 1)
ax1 len(space)), y1, marker='o', markersize=5)
ax1.plot(np.arange(len(space)), [y1[i] - y1_e[i] for i in range(len(space))],
ax1.fill_between(np.arange(+ y1_e[i] for i in range(len(space))], alpha=0.2)
[y1[i] len(space)))
ax1.set_xticks(np.arange('{0:.2f}'.format(i) for i in space])
ax1.set_xticklabels([= RandomForestClassifier(n_estimators=200,min_samples_leaf=1)
estimator 'RF'] = method(estimator = estimator) holder[
Support Vector Machine (SVC)
= []
cv_results from sklearn.svm import LinearSVC
= np.logspace(-3, 3, 7)
space for i in space:
= cross_validate(LinearSVC(C=i), X, Y, cv=3, scoring=rmse, n_jobs=-1,
cv_result =5)
verbose
cv_results.append(cv_result)= [item['test_score'].mean() for item in cv_results]
y1 = [item['test_score'].std() for item in cv_results]
y1_e = plt.subplot(1, 1, 1)
ax1 len(space)), y1, marker='o', markersize=5)
ax1.plot(np.arange(len(space)), [y1[i] - y1_e[i] for i in range(len(space))],
ax1.fill_between(np.arange(+ y1_e[i] for i in range(len(space))], alpha=0.2)
[y1[i] len(space)))
ax1.set_xticks(np.arange('{0:.2f}'.format(i) for i in space])
ax1.set_xticklabels([= LinearSVC(C=0.01)
estimator 'SVM'] = method(estimator = estimator)
holder["../data/another_machine_learning.npy",holder) np.save(
= np.load("../data/another_machine_learning.npy",allow_pickle=True)
holder = []
mean_15mer = []
mean_kmer = []
method_type "TLimmuno2")
method_type.append(0.7332)
mean_15mer.append(0.7372)
mean_kmer.append(for i in holder.tolist().keys():
= pd.DataFrame(holder.tolist()[i])
DF "mer15"].mean())
mean_15mer.append(DF["kmer"].mean())
mean_kmer.append(DF[
method_type.append(i)1] = "Adaboost" method_type[
Deep learning
#PBS
#python
import pandas as pd
import os
import tensorflow as tf
from sklearn.model_selection import train_test_split,KFold
import numpy as np
"CUDA_VISIBLE_DEVICES"] = "-1"
os.environ[from sklearn.metrics import auc,precision_recall_curve,roc_curve,confusion_matrix,average_precision_score
import matplotlib.pyplot as plt
import sys
"/public/slst/home/wanggsh/Project/MHCII/utils")
sys.path.append(from Blosum62 import blosum62
#Data process
= pd.read_csv("/public/slst/home/wanggsh/Data/MHCII_immuno/IMM_data_random2.csv")
immuno_data "pep_blosum"] = immuno_data["Description"].apply(blosum62,args=(21,))
immuno_data["MHC_blosum"] = immuno_data["sequence"].apply(blosum62,args=(34,))
immuno_data[= np.empty((len(immuno_data),21,21))
peptide for i in range(len(immuno_data)):
= immuno_data["pep_blosum"][i].reshape((21,21))
peptide[i] #mhc
= np.empty((len(immuno_data),34,21))
MHC for i in range(len(immuno_data)):
= immuno_data["MHC_blosum"][i].reshape((34,21))
MHC[i] = immuno_data["label"].values
labels
= np.empty((len(immuno_data),21,21,1))
peptide_CNN for i in range(len(immuno_data)):
= immuno_data["pep_blosum"][i].reshape((21,21,1))
peptide_CNN[i,:,:,]
= np.empty((len(immuno_data),34,21,1))
MHC_CNN for i in range(len(immuno_data)):
= immuno_data["MHC_blosum"][i].reshape((34,21,1))
MHC_CNN[i] = immuno_data["label"].values
labels_CNN
def DNN():
= tf.keras.Input(shape = (21,21))
pep = tf.keras.Input(shape = (34,21))
MHC = tf.keras.layers.Flatten()(pep)
x = tf.keras.layers.Flatten()(MHC)
y = tf.keras.layers.concatenate([x,y])
combined
= tf.keras.layers.Dense(400,activation = "relu")(combined)
z = tf.keras.layers.Dense(200,activation = "relu")(z)
z = tf.keras.layers.Dense(1,activation = "sigmoid")(z)
z
= tf.keras.Model(inputs = [pep,MHC],outputs = z)
model
return model
= DNN()
DNN
def CNN():
= tf.keras.Input(shape = (21,21,1))
pep = tf.keras.Input(shape = (34,21,1))
MHC
= tf.keras.layers.Conv2D(filters=16, kernel_size=(2, 12))(pep) # 9
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.activations.relu(x)
x = tf.keras.layers.Conv2D(filters=32, kernel_size=(2, 1))(x) # 8
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.activations.relu(x)
x = tf.keras.layers.MaxPool2D(pool_size=(2, 1), strides=(2, 1))(x) # 4
x = tf.keras.layers.Flatten()(x)
x = tf.keras.Model(inputs=pep, outputs=x)
x
= tf.keras.layers.Conv2D(filters=16, kernel_size=(15, 12))(MHC) # 32
y = tf.keras.layers.BatchNormalization()(y)
y = tf.keras.activations.relu(y)
y = tf.keras.layers.MaxPool2D(pool_size=(2, 1), strides=(2, 1))(y) # 16
y = tf.keras.layers.Conv2D(filters=32,kernel_size=(9,1))(y) # 8
y = tf.keras.layers.BatchNormalization()(y)
y = tf.keras.activations.relu(y)
y = tf.keras.layers.MaxPool2D(pool_size=(2, 1),strides=(2,1))(y) # 4
y = tf.keras.layers.Flatten()(y)
y = tf.keras.Model(inputs=MHC,outputs=y)
y
= tf.keras.layers.concatenate([x.output,y.output])
combined = tf.keras.layers.Dense(128,activation='relu')(combined)
z = tf.keras.layers.Dense(1,activation='sigmoid')(z)
z
= tf.keras.Model(inputs=[pep,MHC],outputs=z)
model return model
= CNN()
CNN = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)#change
callback = tf.keras.losses.BinaryCrossentropy()
Loss = tf.keras.optimizers.Adam(learning_rate = 0.00008)
Optimizer = [tf.keras.metrics.AUC(),tf.keras.metrics.BinaryAccuracy(),tf.keras.metrics.AUC(curve = "PR"),tf.keras.metrics.Precision()] #add ROC&PR auc &PPV in metrics
Metrics = 200
Batch_size= 150
Epochs= 2
Verbose
compile(
DNN.= Loss,
loss = Optimizer,
optimizer = Metrics)
metrics = train_test_split(peptide,MHC,labels,test_size=0.1,random_state=202209,stratify=labels)
pep_Train,pep_Test,MHC_Train,MHC_Test,label_Train,label_Test = DNN.fit([pep_Train,MHC_Train],label_Train,
history =Batch_size,epochs=Epochs,
batch_size= ([pep_Test,MHC_Test],label_Test),
validation_data = Verbose,
verbose = callback)
callbacks
= pd.read_csv("/public/slst/home/wanggsh/Data/MHCII_immuno/kmer_data.csv")
benchmark_Data "pep_blosum"] = benchmark_Data["pep"].apply(blosum62,args=(21,))
benchmark_Data["MHC_blosum"] = benchmark_Data["sequence"].apply(blosum62,args=(34,))
benchmark_Data[= np.empty((len(benchmark_Data),21,21))
bench_pep for i in range(len(benchmark_Data)):
= benchmark_Data["pep_blosum"][i].reshape((21,21))
bench_pep[i] #mhc
= np.empty((len(benchmark_Data),34,21))
bench_MHC for i in range(len(benchmark_Data)):
= benchmark_Data["MHC_blosum"][i].reshape((34,21))
bench_MHC[i] = DNN.predict([bench_pep,bench_MHC])
prediction
"prediction"] = prediction
benchmark_Data["pep_blosum")
benchmark_Data.pop("MHC_blosum")
benchmark_Data.pop("/public/slst/home/wanggsh/Data/MHCII_immuno/kmer_DNN_result.csv")
benchmark_Data.to_csv(compile(
CNN.= Loss,
loss = Optimizer,
optimizer = Metrics)
metrics = train_test_split(peptide_CNN,MHC_CNN,labels_CNN,test_size=0.1,random_state=202209,stratify=labels)
pep_Train,pep_Test,MHC_Train,MHC_Test,label_Train,label_Test = CNN.fit([pep_Train,MHC_Train],label_Train,
history =Batch_size,epochs=Epochs,
batch_size= ([pep_Test,MHC_Test],label_Test),
validation_data = Verbose,
verbose = callback)
callbacks
= np.empty((len(benchmark_Data),21,21,1))
bench_pep_CNN for i in range(len(benchmark_Data)):
= benchmark_Data["pep_blosum"][i].reshape((21,21,1))
bench_pep_CNN[i] #mhc
= np.empty((len(benchmark_Data),34,21,1))
bench_MHC_CNN for i in range(len(benchmark_Data)):
= benchmark_Data["MHC_blosum"][i].reshape((34,21,1))
bench_MHC_CNN[i] = CNN.predict([bench_pep_CNN,bench_MHC_CNN])
prediction "prediction"] = prediction
benchmark_Data["pep_blosum")
benchmark_Data.pop("MHC_blosum")
benchmark_Data.pop("/public/slst/home/wanggsh/Data/MHCII_immuno/kmer_CNN_result.csv") benchmark_Data.to_csv(
#CNN result
= pd.read_csv("../data/peptide.csv")
peptide = pd.read_csv("../data/15mer_CNN_result.csv")
result = IMM_process(result,peptide)
result_15mer_CNN = auc_pr(result_15mer_CNN,"Immunogenicity","prediction")
result "CNN")
method_type.append(4])
mean_15mer.append(result[= pd.read_csv("../data/kmer_CNN_result.csv")
result = IMM_process(result,peptide)
result_15mer = auc_pr(result_15mer,"Immunogenicity","prediction")
result 4]) mean_kmer.append(result[
#DNN result
= pd.read_csv("../data/15mer_DNN_result.csv")
result = IMM_process(result,peptide)
result_15mer_DNN = auc_pr(result_15mer_DNN,"Immunogenicity","prediction")
result "DNN")
method_type.append(4])
mean_15mer.append(result[= pd.read_csv("../data/kmer_DNN_result.csv")
result = IMM_process(result,peptide)
result_15mer = auc_pr(result_15mer,"Immunogenicity","prediction")
result 4]) mean_kmer.append(result[
= pd.DataFrame({"method":method_type,"mer15":mean_15mer,"kmer":mean_kmer}) DF
Result
import seaborn as sns
from matplotlib import cm
= 10
n = np.arange(0, n, step=.5)
x = x ** 2
y = plt.Normalize(y.min(), y.max())
norm = norm(y)
norm_y = cm.get_cmap(name='viridis')
map_vir = map_vir(norm_y)
color = plt.subplots(figsize = (4,6))
fig,ax = DF["method"],height = DF["mer15"],width =0.95,color = color[10:])
plt.bar(x #> <BarContainer object of 7 artists>
"")
ax.set_xlabel(#> Text(0.5, 0, '')
"AUC",fontsize = 14)
ax.set_ylabel(#> Text(0, 0.5, 'AUC')
0.5,0.75])
ax.set_ylim([#> (0.5, 0.75)
= 90)
plt.xticks(rotation #> ([0, 1, 2, 3, 4, 5, 6], [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])
plt.tight_layout()"top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["../figure/different_methods_15mer.pdf",dpi = 300,bbox_inches='tight',transparent=True)
plt.savefig( plt.show()
= plt.subplots(figsize = (4,6))
fig,ax = DF["method"],height = DF["kmer"],width =0.95,color = color[10:])
plt.bar(x #> <BarContainer object of 7 artists>
"")
ax.set_xlabel(#> Text(0.5, 0, '')
"AUC",fontsize = 14)
ax.set_ylabel(#> Text(0, 0.5, 'AUC')
0.5,0.75])
ax.set_ylim([#> (0.5, 0.75)
= 90)
plt.xticks(rotation #> ([0, 1, 2, 3, 4, 5, 6], [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])
plt.tight_layout()"top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["../figure/different_methods_kmer.pdf",dpi = 300,bbox_inches='tight',transparent=True)
plt.savefig( plt.show()
TLimmuno2 outperforms DNN, CNN and traditional machine learning methods, proving that our model structure can better fit the problem of immunogenicity prediction.
TLimmuno2 can learn important features that determine immunogenicity
Immunotherapy
To further validate TLimmuno2 and demonstrate the value of our model as a knowledge discovery tool, we evaluated the physiological relevance of TLimmuno2’s prediction. Most current cancer vaccine platforms preferentially screen candidate neoepitopes for vaccine production by selecting highly expressed candidate neoepitopes with high binding affinity to HLA alleles. However, despite rigorous candidate selection, many vaccine peptides do not elicit T-cell responses after vaccination. Therefore, we compare the performance of TLimmuno2 and NetMHCIIpan by using personalized melanoma neoepitopes with corresponding immune response data
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl
'pdf.fonttype'] = 42
mpl.rcParams['ps.fonttype'] = 42
mpl.rcParams['font.family':'Arial'})
plt.rcParams.update({from sklearn.metrics import accuracy_score, recall_score, f1_score, precision_score
def IMM_process_rank(Data,peptide):
= Data["pep_ID"].unique()
sam_pep_ID = []
res = []
p_ID "length"] = Data["pep"].map(len)
Data[for a in sam_pep_ID:
= []
Mean = Data[Data["pep_ID"] == a]
ID for i in ID["HLA"].unique():
= ID[ID["HLA"] == i]
Data_HLA = []
prediction for l in Data_HLA["length"].unique():
= Data_HLA[Data_HLA["length"] == l]
Data_len = Data_len["Rank"].min()
pre
prediction.append(pre)= np.min(prediction)
x_mean
Mean.append(x_mean)min(Mean))
res.append(
p_ID.append(a)= pd.DataFrame({"pep_ID":p_ID,"prediction":res})
result = result.merge(peptide,how = "inner",on = "pep_ID" )
result return result
def netMHCIIpan_process(result,allele,ori_sorted,peptide):
#combine two type allele
= pd.merge(result,allele,left_on="allele",right_on="Allele Name")
result = result.sort_values(by=["peptide","HLA"]).reset_index(drop = True)
result_sorted #combine with ori data
= pd.merge(result_sorted,ori_sorted[["pep","pep_ID"]],left_on = "peptide",right_on ="pep")
result_combined #result_combined["pep_length"] = result_combined["peptide"].map(len)
= result_combined["pep_ID"].unique()
sam_pep_ID = []
res = []
p_ID for a in sam_pep_ID:
= []
Mean = result_combined[result_combined["pep_ID"] == a]
ID #get Mean of all splited pep and Max of all HLA
for i in ID["Allele Name"].unique():
= ID[ID["Allele Name"] == i]
Data_HLA = []
prediction for l in Data_HLA["pep_length"].unique():
= Data_HLA[Data_HLA["pep_length"] == l]
Data_len = Data_len["percentile_rank"].min()
pre
prediction.append(pre)= np.min(prediction)
x min(x))
res.append(np.
p_ID.append(a) = pd.DataFrame({"pep_ID":p_ID,"prediction":res})
final_result = final_result.merge(peptide,how = "inner",on = "pep_ID" )
final_result return final_result
= ["O.1","O.2","O.3","O.4","O.5","O.6"]
patient_id = pd.read_csv("../data/DF15.csv") DF15
= pd.read_csv("../data/15mer_rank_result.csv")
result_15 = pd.read_csv("../data/peptide.csv")
peptide= IMM_process_rank(result_15,peptide)
Rank_15mer = DF15.sort_values(by=["pep","HLA"]).reset_index(drop = True)
DF15_sorted = pd.read_csv("../data/allele.csv")
allele = pd.read_csv("../data/netMHCIIpan_15_mer_ba.csv")
netMHCIIpan_ba "pep_length"] = netMHCIIpan_ba["peptide"].map(len)
netMHCIIpan_ba[= netMHCIIpan_process(netMHCIIpan_ba,allele,DF15_sorted,peptide)
net_15mer_ba = pd.read_csv("../data/netMHCIIpan_15_mer_el.csv")
netMHCIIpan_el "pep_length"] = netMHCIIpan_el["peptide"].map(len)
netMHCIIpan_el[= netMHCIIpan_process(netMHCIIpan_el,allele,DF15_sorted,peptide)
net_15mer_el = net_15mer_ba[net_15mer_ba["Patient_ID"].map(lambda x : x in patient_id)]
ott_net_ba = net_15mer_el[net_15mer_el["Patient_ID"].map(lambda x : x in patient_id)]
ott_net_el = Rank_15mer[Rank_15mer["Patient_ID"].map(lambda x : x in patient_id)] ott_imm
def call_value(data,net = True):
if net:
= data["prediction"].map(lambda x : 1 if x <=2 else 0)
preds else:
= data["prediction"].map(lambda x : 1 if x <=0.02 else 0)
preds = data["Immunogenicity"]
trues = accuracy_score(trues, preds)
acc = precision_score(trues, preds)
p = recall_score(trues, preds)
r = f1_score(trues, preds)
f1
return acc,p,r,f1
#top20&top50
from collections import Counter
= ott_imm.sort_values(by="prediction")
ott_imm_sorted = ott_net_el.sort_values(by="prediction")
ott_net_el_sorted = ott_net_ba.sort_values(by="prediction")
ott_net_ba_sorted "pre_label"] = ott_imm_sorted["prediction"].map(lambda x : 1 if x <=0.02 else 0)
ott_imm_sorted["pre_label"] = ott_net_el_sorted["prediction"].map(lambda x : 1 if x <=2 else 0)
ott_net_el_sorted["pre_label"] = ott_net_ba_sorted["prediction"].map(lambda x : 1 if x <=2 else 0)
ott_net_ba_sorted[def call_top(x):
= x[0:20]
x20 = Counter(x20["Immunogenicity"] == x20["pre_label"])[True]
top20 = x[0:50]
x50 = Counter(x50["Immunogenicity"] == x50["pre_label"])[True]
top50
return top20,top50
= call_top(ott_imm_sorted)
imm20,imm50 = call_top(ott_net_ba_sorted)
net_ba20,net_ba50 = call_top(ott_net_el_sorted)
net_el20,net_el50 = [imm20,net_ba20,net_el20]
top20 = [imm50,net_ba50,net_el50] top50
= call_value(ott_imm,False)
imm = call_value(ott_net_ba)
net_ba = call_value(ott_net_el)
net_el = pd.DataFrame([imm,net_ba,net_el], index = ["TLimmuno2","NetMHCIIpan_BA","NetMHCIIpan_EL"],columns = ["Accuracy","Precision","Recall","F1-score"])
data "Method"] = data.index
data[= data.melt(id_vars = "Method")
data = plt.subplots(figsize = (6,4))
fig,ax = "variable",y = "value", hue = "Method",data = data,palette = "Set2")
sns.barplot(x #> <AxesSubplot:xlabel='variable', ylabel='value'>
"")
ax.set_xlabel(#> Text(0.5, 0, '')
"Score",fontsize = 14)
ax.set_ylabel(#> Text(0, 0.5, 'Score')
"top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["../figure/neoantigen_vaccine.pdf",dpi = 300,transparent=True)
plt.savefig( plt.show()
= pd.DataFrame([top20,top50], columns = ["TLimmuno2","netMHCIIpan_ba","netMHCIIpan_el"],index = ["Top20","Top50"])
data "Method"] = data.index
data[= data.melt(id_vars = "Method")
data = plt.subplots(figsize = (3,4))
fig,ax = "Method",y = "value", hue = "variable",data = data,palette = "Set2")
sns.barplot(x #> <AxesSubplot:xlabel='Method', ylabel='value'>
"")
ax.set_xlabel(#> Text(0.5, 0, '')
ax.get_legend().remove()"Immunogenic peptides",fontsize = 14)
ax.set_ylabel(#> Text(0, 0.5, 'Immunogenic peptides')
"top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["../figure/neoantigen_top20&50.pdf",dpi = 300,bbox_inches='tight',transparent=True)
plt.savefig( plt.show()
As shown in figure, TLimmuno2 performs better than the binding affinity model in accuracy, recall and F1 score by using 2% rank as a cut-off. The TopK metric (K=20 or 50, the number of positive samples in the top K samples with the lowest predicted ranks) is also used to evaluate the performance of different models, and TLimmuno2 shows improved performance compared with netMHCIIpan BA in these TopK metrics.
Immunoediting
The interactions between immune cells and tumor cells are reflected as immunoediting, which could mediate the negative selection of DNA alterations encoding high immunogenicity. The status of MHC class II neoantigen mediated negative selection in cancer evolution is still unknown. We next applied the TLimmuno2 to predict the immunogenicity of neopeptides arising from missense mutation in TCGA samples.
We then detect the immunoediting signals in TCGA samples by using the previously published CCF enrichment score (ESccf) method
we download 33 cancer type somatic mutation file form ucsx xena, we focus on “MuTect2 Variant Aggregation and Masking”. here the detail of dataprocess:
#R
= read_csv("../data/cancer_type.csv")
TCGA_data #> Rows: 33 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): Cancer
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
::datatable(TCGA_data,rownames = FALSE) DT
- from
maf
tomut_pep
:
#python
#PBS_batch
import pandas as pd
import numpy as np
import os
import re
import sys
file = sys.argv[1]
= "/public/slst/home/wanggsh/TCGA-process/Maf"
maf_file_path def maf2vcf(MAF_file,tumor_file_path):
for i in MAF_file["Tumor_Sample_Barcode"].unique():
= MAF_file[MAF_file["Tumor_Sample_Barcode"] == i]
split_maf_file "{0}/Tmp/{1}.maf".format(tumor_file_path,i),sep="\t",index=False)
split_maf_file.to_csv(# maf2vcf comd
="""
comd1 perl /public/home/liuxs/anaconda3/envs/neo/bin/maf2vcf.pl \
--input-maf "{0}/Tmp/{1}.maf" \
--output-dir {0}/VCF \
--output-vcf {0}/VCF/{1}.vcf \
--ref-fasta /public/home/liuxs/biodata/reference/genome/hg38/hg38.fa
""".format(tumor_file_path,i)
os.system(comd1)def vcf2pep(tumor_file_path):
for i in os.listdir("{0}/VCF".format(tumor_file_path)):
print("Get pep from vcf in sample {0}".format(i))
"Rscript /public/slst/home/wanggsh/TCGA-process/vcf2pep.R {0}/VCF/{1} {0}/Pep/{1}.csv".format(tumor_file_path,i))
os.system(= pd.read_table("{0}/{1}".format(maf_file_path,file),comment="#")
MAF_file = re.split("\.",file)[0]
tumor_type = "/public/slst/home/wanggsh/TCGA-process/Res/{0}".format(tumor_type)
file_path "mkdir {0}".format(file_path))
os.system("mkdir {0}/Tmp".format(file_path))
os.system("mkdir {0}/VCF".format(file_path))
os.system(print("maf to vcf in sample {0}".format(file))
maf2vcf(MAF_file,file_path)"rm {0}/VCF/*.tsv".format(file_path))
os.system("mkdir {0}/Pep".format(file_path))
os.system( vcf2pep(file_path)
here are vcf2pep.R
script, we only choose missense
mutation to get peptide sequence.
#vcf2pep.R
#code from WuTao
rary(tidyverse)
library(stringr)
= commandArgs(trailingOnly = TRUE)
args
print(args)
= "/public/slst/home/wanggsh/biosoft/annovar"
annovar_path
<- function(annovar_path,vcf_path,out_file,
vcf2annova need_allsamples=TRUE,need_samples){
<- match.arg(as.character(need_allsamples),choices = c("TRUE","FALSE"))
need_allsamples
if (need_allsamples){
<- paste0(annovar_path,"/convert2annovar.pl -format vcf4 ",vcf_path," -outfile ",out_file," -allsample -withfreq")
commond system(command = commond)
else{
}<- paste0(annovar_path,"/convert2annovar.pl -format vcf4 ",vcf_path," -outfile ",out_file," -allsample")
commond system(command = commond)
<- paste0("cat ",out_file,".",need_samples,".avinput >> ",out_file)
need_samples_file system(need_samples_file)
}
}<- function(annovar_path,vcf_path,
vcf2pep genome_version=c("hg19","hg38"),need_allsamples=TRUE,need_samples,num_thread){
<- match.arg(as.character(need_allsamples),choices = c("TRUE","FALSE"))
need_allsamples <- match.arg(genome_version)
genome_version <- tempfile()
temp_file <- tempdir()
temp_dir file.create(temp_file)
vcf2annova(annovar_path = annovar_path,vcf_path = vcf_path,out_file = temp_file,need_allsamples = need_allsamples,
need_samples = need_samples)
<- read.table(temp_file)
mut <- mut[,1:5]
mut colnames(mut) <- c("chr","start","end","ref","alt")
<- paste0(annovar_path,"/table_annovar.pl ",temp_file," ",annovar_path,"/humandb/",
comm_annotate " -build ",genome_version,
" --outfile ",temp_dir,"/myanno",
' -protocol refGene -operation g --codingarg "-includesnp -onlyAltering"',
" --thread ",num_thread)
system(comm_annotate)
<- seqinr::read.fasta(file = paste0(temp_dir,"/myanno.refGene.fa"),
dt seqtype = "AA",as.string = TRUE,whole.header = T)
<- names(dt)[seq(2,length(dt),by=2)]
variants <- variants[grepl("protein-altering",variants)]
variants <- stringr::str_extract_all(variants,"p[.].+ protein-altering") %>% gsub(" protein-altering","",.)
pos_alter <- stringr::str_extract_all(variants,"c[.].+ p[.]") %>% gsub(" p[.]","",.)
cdna <- stringr::str_extract_all(variants,"line.+ NM_") %>% gsub(" NM_","",.) %>%
lines gsub("line","",.) %>% as.numeric()
<- stringr::str_extract_all(variants,"line.+ NM_[0-9]+") %>% gsub("line.+ ","",.)##extract transcripts
ENST
<- mut[lines,]
mut_need <- dt[sapply(variants,function(x){which(names(dt)==x)}) %>% unname()] %>% as.character()
pep_seq_mt <- dt[(sapply(variants,function(x){which(names(dt)==x)}) %>% unname())-1] %>% as.character()
pep_seq_wt $seq_wt <- pep_seq_wt
mut_need$seq_mt <- pep_seq_mt
mut_need$pos_alter <- pos_alter
mut_need$cdna <- cdna
mut_need$transcript <- ENST
mut_need<- c(list.files(temp_dir,pattern = gsub(paste0(temp_dir,"/"),"",temp_file),full.names = T),
files_exist list.files(temp_dir,pattern = "myanno",full.names = T))##remove all temp files
file.remove(files_exist)
return(mut_need)
}
<- function(seq,pos,len,indel=FALSE){
extractSeq <- match.arg(as.character(indel),choices = c("TRUE","FALSE"))
indel if (pos < len){
<- substr(seq,1,(pos+len-1))
ext_seq else if ((nchar(seq)-pos)<(len-1)){
}<- substr(seq,(pos-len+1),nchar(seq))
ext_seq else{
}<- substr(seq,(pos-len+1),(pos+len-1))
ext_seq
}
if (indel){
if (pos < len){
<- substr(seq,1,nchar(seq))
ext_seq else{
}<- substr(seq,(pos-len+1),nchar(seq))
ext_seq
}
}return(ext_seq)
}<- function(annovar_path,vcf_path,
vcf2seq genome_version=c("hg19","hg38"),need_allsamples=TRUE,need_samples,len,num_thread){
<- vcf2pep(annovar_path = annovar_path,vcf_path = vcf_path,
pep genome_version = genome_version,need_allsamples = need_allsamples,
need_samples = need_samples,num_thread=num_thread)
<- pep %>%
pep ::mutate(pos = stringr::str_extract(pos_alter,"[0-9]+") %>% as.numeric()) %>%
dplyr::rowwise() %>%
dplyr::mutate(indel=ifelse(ref != "-" & alt != "-" & nchar(ref) == nchar(alt),"FALSE","TRUE")) %>%
dplyras.data.frame()
<- mapply(extractSeq,seq=pep$seq_mt,pos=as.numeric(pep$pos),
ext_seqs_mt len=as.numeric(len),indel=pep$indel) %>% unname()
<- mapply(extractSeq,seq=pep$seq_wt,pos=as.numeric(pep$pos),
ext_seqs_wt len=as.numeric(len),indel=pep$indel) %>% unname()
##remove the last star
<- substr(ext_seqs_mt,nchar(ext_seqs_mt),nchar(ext_seqs_mt))=="*"
a <- substr(ext_seqs_wt,nchar(ext_seqs_wt),nchar(ext_seqs_wt))=="*"
b <- gsub("\\*","",ext_seqs_mt[a])
ext_seqs_mt[a] <- gsub("\\*","",ext_seqs_wt[b])
ext_seqs_wt[b]
$ext_seqs_wt <- ext_seqs_wt
pep$ext_seqs_mt <- ext_seqs_mt
pepreturn(pep)
}#vcf_path = "/public/slst/home/wanggsh/TCGA-process/Res/TCGA_ACC/VCF/TCGA-OR-A5J1-01A-11D-A29I-10.vcf"
#pep_file = "/public/slst/home/wanggsh/TCGA-process/Res/TCGA_ACC/Pep/result.csv"
= args[1]
vcf_path = args[2]
pep_file = unlist(strsplit(vcf_path,split ="/"))
tumor_sample_barcode = tumor_sample_barcode[length(tumor_sample_barcode)]
tumor_sample_barcode = unlist(strsplit(tumor_sample_barcode,split ="\\."))[1]
tumor_sample_barcode <- vcf2seq(annovar_path = annovar_path,vcf_path=vcf_path,genome_version = "hg38",need_allsamples =TRUE,len = 15,num_thread=1)
pep
= pep[pep$indel== FALSE,]
pep = select(pep,-c("seq_wt","seq_mt"))
pep $tumor_sample_barcode = tumor_sample_barcode
pepwrite_csv(pep,pep_file,col_names = FALSE)
- split mutated peptides into 15mer subsequence:
#python
#PBS
import pandas as pd
import os
import re
= "/public/slst/home/wanggsh/TCGA-process/Res"
file_path = os.listdir(file_path)
tumor_type = ["chr","start","end","ref","alt","pos_alter","cdna","transcript","pos","indel","ext_seqs_wt","ext_seqs_mt","tumor_sample_barcode"]
colnames def mer15(Pep):
= []
P = 15
Length for i in range(len(Pep) - Length +1):
= Pep[i:i+Length]
pep
P.append(pep)return P
for i in tumor_type:
= pd.DataFrame()
DF = "/public/slst/home/wanggsh/TCGA-process/Res/{0}/Pep".format(i)
pep_file_path = os.listdir(pep_file_path)
pep_file for x in pep_file:
= re.split("\.",x)[0]
tumor_barcode = pd.read_csv("{0}/{1}".format(pep_file_path,x),names = colnames)
pep = pep[["chr","start","end","ref","alt","ext_seqs_mt","tumor_sample_barcode"]]
pep = pep.drop_duplicates()
pep "length"] = pep["ext_seqs_mt"].map(len)
pep[= pep.query("length >=15")
pep = []
Pep15 = []
Pep_num = []
Chrom = []
Start = []
End = []
Refer = []
Alt = pep.reset_index(drop=True)
pep for y in range(len(pep)):
= mer15(pep["ext_seqs_mt"][y])
res = ["Pep{0}".format(y+1)] *len(res)
pep_num = [pep["chr"][y]] *len(res)
chrom = [pep["start"][y]] *len(res)
start = [pep["end"][y]] *len(res)
end = [pep["ref"][y]] *len(res)
refer = [pep["alt"][y]] *len(res)
alt
Pep15.extend(res)
Pep_num.extend(pep_num)
Chrom.extend(chrom)
Start.extend(start)
End.extend(end)
Refer.extend(refer)
Alt.extend(alt)= pd.DataFrame({"chr":Chrom,"start":Start,"end":End,"ref":Refer,"alt":Alt,"ext_seqs_mt":Pep15,"Pep_num":Pep_num})
df "tumor_sample_barcode"] = tumor_barcode
df[= pd.concat([DF,df])
DF print("{0} Finish".format(i))
= DF.drop_duplicates()
DF "/public/slst/home/wanggsh/TCGA-process/mut_pep2/{0}_pep.csv".format(i)) DF.to_csv(
- predicte peptide score by using TLimmuno2. HLA typing data was downloaded from Benchmarking HLA genotyping and clarifying HLA impact on survival in tumor immunotherapy study and transformed to the format required by the neoantigen prediction tools. we predicted HLA-DR immunogenicity score of mutated peptides’ subsequence.
#python
#PBS
import pandas as pd
import numpy as np
import tensorflow as tf
import sys
"/public/slst/home/wanggsh/Project/MHCII/utils")
sys.path.append(from Blosum62 import blosum62
import os
import re
"CUDA_VISIBLE_DEVICES"]="-1"
os.environ[#load_model
#strategy = tf.distribute.MirroredStrategy() #use mutiple GPU
= tf.keras.models.load_model("/public/slst/home/wanggsh/Project/Saved_model/new_BA_model/model1")
BA_model = tf.keras.models.Model(inputs = BA_model.input,outputs = BA_model.layers[-2].output)
BAmodel = tf.keras.models.load_model("/public/slst/home/wanggsh/Project/Saved_model/MHCII_immuno/model")
IMM #mutant pep file
="/public/slst/home/wanggsh/TCGA-process/mut_pep2"
mut_pep_file_path
= pd.read_csv("/public/slst/home/wanggsh/TCGA-process/TCGA-MHCII-allele.csv")
MHC_II = pd.read_table("/public/slst/home/wanggsh/Data/MHCII/pseudosequence.2016.all.X.dat",names=["hla","sequence"])
pseudo_seq = sys.argv[1]
i = re.sub("_pep.csv","",i)
tumor print("{0} Start".format(tumor))
= pd.read_csv("{0}/{1}".format(mut_pep_file_path,i))
Data = MHC_II[MHC_II["project_id"] == tumor]
MHC_II_smal "Sample_Barcode"] = Data["tumor_sample_barcode"].map(lambda x : x[0:12])
Data[= pd.merge(Data,MHC_II_smal[["Sample_Barcode","hla"]])
Data = pd.merge(Data,pseudo_seq)
Data "pep_blosum"] = Data["ext_seqs_mt"].apply(blosum62,args=(21,))
Data["MHC_blosum"] = Data["sequence"].apply(blosum62,args=(34,))
Data[= np.empty((len(Data),21,21))
peptide for i in range(len(Data)):
= Data["pep_blosum"][i].reshape((21,21))
peptide[i]
= np.empty((len(Data),34,21))
MHC for i in range(len(Data)):
= Data["MHC_blosum"][i].reshape((34,21))
MHC[i]
= BAmodel.predict([peptide,MHC])
BA = IMM.predict([peptide,MHC,BA])
IMM_result "prediction"] = IMM_result
Data["Unnamed: 0")
Data.pop("sequence")
Data.pop("pep_blosum")
Data.pop("MHC_blosum")
Data.pop("/public/slst/home/wanggsh/TCGA-process/Prediction/IMM2/{0}.csv".format(tumor))
Data.to_csv(print("{0} Finish".format(tumor))
4.Background peptide predict
#PBS
#python
import pandas as pd
import sys
import numpy as np
"/public/slst/home/wanggsh/Project/MHCII/utils")
sys.path.append(from Blosum62 import blosum62
import os
#os.environ["CUDA_VISIBLE_DEVICES"]="-1"
import tensorflow as tf
= tf.keras.models.load_model("/public/slst/home/wanggsh/Project/Saved_model/new_BA_model/model1")
BA_model = tf.keras.models.Model(inputs = BA_model.input,outputs = BA_model.layers[-2].output)
BAmodel = tf.keras.models.load_model("/public/slst/home/wanggsh/Project/Saved_model/MHCII_immuno/model")
IMM = pd.read_csv("/public/slst/home/wanggsh/TCGA-process/TCGA-MHCII-allele.csv")
TCGA_allele = pd.read_feather("/public/slst/home/wanggsh/Data/pseudo_blosum62.feather")
pseudo_seq2 = pd.read_csv("/public/slst/home/wanggsh/Data/MHCII_immuno/IMM_bg_pep.csv")
IMM_bg_pep "pep_blosum"] = IMM_bg_pep["pep"].apply(blosum62,args=(21,))
IMM_bg_pep[= np.intersect1d(TCGA_allele["hla"].values,pseudo_seq2["MHC"].values)
inteset #np.setdiff1d(TCGA_allele["hla"].values,pseudo_seq2["MHC"].values)
for i in inteset:
= pd.DataFrame()
DF "MHC"] = i
IMM_bg_pep[= pd.merge(IMM_bg_pep,pseudo_seq2[["MHC","MHC_blosum"]])
x = pd.concat([DF,x])
DF = np.empty((len(DF),21,21))
peptide for z in range(len(DF)):
= DF["pep_blosum"][z].reshape((21,21))
peptide[z] = np.empty((len(DF),34,21))
MHC for z in range(len(DF)):
= x["MHC_blosum"][z].reshape((34,21))
MHC[z] = BAmodel.predict([peptide,MHC])
BA = IMM.predict([peptide,MHC,BA])
IMM_result "Prediction"] = IMM_result
DF[= DF.drop(["Unnamed: 0","pep_blosum","MHC_blosum"],axis=1)
DF "/public/slst/home/wanggsh/TCGA-process/back_ground_pep/{0}.csv".format(i)) DF.to_csv(
5.Ranking: use background peptides to do rank
#PBS_batch
#python
import pandas as pd
import numpy as np
import tensorflow as tf
import sys
"/public/slst/home/wanggsh/Project/MHCII/utils")
sys.path.append(from Blosum62 import blosum62
import os
import re
"CUDA_VISIBLE_DEVICES"]="-1"
os.environ[
= "/public/slst/home/wanggsh/TCGA-process/Prediction/IMM"
IMM_resul_path = "/public/slst/home/wanggsh/TCGA-process/back_ground_pep"
background_path file = os.listdir(IMM_resul_path)
= sys.argv[1]
arg = pd.read_csv("{}/{}".format(IMM_resul_path,arg))
res = pd.DataFrame()
DF for y in res["hla"].unique():
= res[res["hla"]==y]
res_all = pd.read_csv("{}/{}.csv".format(background_path,y))
back_ground = back_ground["Prediction"].values.tolist()
back_ground = []
Rank for I in res_all["prediction"].values:
back_ground.append(I)= 1-(sorted(back_ground).index(back_ground[-1])+1)/90001
rank
Rank.append(rank)
back_ground.pop()"Rank"] = Rank
res_all[= pd.concat([DF,res_all])
DF "/public/slst/home/wanggsh/TCGA-process/Prediction/IMM_rank/{0}_rank.csv".format(arg)) DF.to_csv(
#python
#PBS
import pandas as pd
import matplotlib.pyplot as plt
import os
import re
import numpy as np
= "/public/slst/home/wanggsh/TCGA-process/Prediction/IMM_rank2"
file_path file = os.listdir(file_path)
= []
Barcode = []
Rank for i in file:
= pd.read_csv("{}/{}".format(file_path,i))
Data = re.split(".csv_rank.csv",i)[0]
tumor
Barcode.append(tumor)"Rank"])
Rank.append(Data[= np.array(Rank)
Rank "/public/slst/home/wanggsh/TCGA-process/Prediction/TCGA_rank.npy",Rank)
np.save(= []
barcode for i in Barcode:
= re.split("TCGA_",i)[1]
x
barcode.append(x)= plt.subplots(figsize = (12,6))
fig,axs =barcode,showcaps=False)
plt.boxplot(Rank,labels= 270)
plt.xticks(rotation 0.02,colors="red",xmin=0.9,xmax=33.1)
plt.hlines("Prediction rank")
plt.ylabel("/public/slst/home/wanggsh/TCGA-process/TCGA-rank.png",dpi = 300,bbox_inches='tight',transparent=True)
plt.savefig(
= []
Barcode = []
Prediction for i in file:
= pd.read_csv("{}/{}".format(file_path,i))
Data = re.split(".csv_rank.csv",i)[0]
tumor
Barcode.append(tumor)"prediction"])
Prediction.append(Data[= np.array(Prediction)
Prediction "/public/slst/home/wanggsh/TCGA-process/Prediction/TCGA_prediction.npy",Prediction)
np.save(
= plt.subplots(figsize = (12,6))
fig,axs =barcode,flierprops = dict(marker='o', markersize=2,linestyle='none'),showcaps=False,showfliers = False)
plt.boxplot(Prediction,labels= 270)
plt.xticks(rotation "Prediction score")
plt.ylabel("/public/slst/home/wanggsh/TCGA-process/TCGA-prediction.png",dpi = 300,bbox_inches='tight',transparent=True)
plt.savefig(
= plt.subplots(figsize = (12,6))
fig,axs =barcode,flierprops = dict(marker='o', markersize=2,linestyle='none'),showcaps=False)
plt.boxplot(Prediction,labels= 270)
plt.xticks(rotation "Prediction score")
plt.ylabel("/public/slst/home/wanggsh/TCGA-process/TCGA-prediction-with-Discretevalues.png",dpi = 300,bbox_inches='tight',transparent=True) plt.savefig(

TCGA-Rank

TCGA-prediction-score-withoutdiscretevalue

TCGA-prediction-score
By comparing prediction score and rank value in each TCGA samples, we found that most scores are less than 0.1 and the median rank value is around 0.5. It suggest that most mutant peptides are non-immunogenic.
Compare the different of MHCI BA ,MHCII BA and Immunogenicity
To evaluate the extent to which the TLimmuno2 learned a signal that is also learned by the binding affinity predictor, we measured the Pearson correlation between immunogenicity rank and binding affinity rank.
- combine MHC I binding affinity rank value and MHC II immunogenicity rank value
setwd("~/Project/MHCII/Data/TCGA-MHC-I/")
<- list.files("/home/data/NeoPredPipe_results/",full.names = T,pattern = "batch")
files = tibble()
final_result for (i in 1:100) {
<- data.table::fread(files[i],sep = "\t",fill = TRUE)
tt <- tt %>% dplyr::select(c(1,4:7,11,12,22,24))
tt colnames(tt) <- c("sample","chr","position","ref","alt","hla","pep","rank_el","rank_ba")
#tt$hla <-str_replace(tt$hla,"\\:","")
= c()
sample = c()
chr = c()
pos = c()
Rank = c()
HLA for (x in unique(tt$sample)){
= tt[tt$sample == x, ]
tt_sample for (y in unique(tt_sample$chr)){
= tt_sample[tt_sample$chr == y,]
tt_chr for (z in unique(tt_chr$position)){
= tt_chr[tt_chr$position == z,]
tt_pos = c()
prediction for (h in unique(tt_chr$hla)){
= tt_pos[tt_pos$hla == h,]
tt_hla = min(tt_hla$rank_ba)
pre = c(prediction,pre)
prediction
}= min(prediction)
rank = c(sample,x)
sample = c(chr,y)
chr = c(pos,z)
pos = c(HLA,h)
HLA = c(Rank,rank)
Rank
}
}
}= tibble(sample,chr,pos,HLA,Rank)
res = rbind(final_result,res)
final_result message("complete ", i)
}saveRDS(final_result,"MHCI_BA.rds")
#MHCII_BA
<- list.files("~/Project/MHCII/Data/TCGA/neoantigen_es",full.names = T)
files <- lapply(files,
res function(x){
read_csv(x)
})<- bind_rows(res)
MHC_II_imm <- list.files("~/Project/MHCII/Data/TCGA/net_neoantigen_es",full.names = T)
files
<- lapply(files,
res function(x){
read_csv(x)
})<- bind_rows(res)
MHC_II_BA <- readRDS("MHCI_BA.rds")
MHC_I_BA $sample <- str_sub(MHC_I_BA$sample,end = 12)
MHC_I_BA<- MHC_I_BA %>% select(sample,chr,pos,Rank)%>% rename("MHCI_BA_Rank" = "Rank") %>% rename("sample_barcode" = "sample")
MHC_I_BA <- MHC_II_imm %>% select(Sample_Barcode,chr,start,Result)
MHC_II_imm colnames(MHC_II_imm) <- c("sample_barcode","chr","pos","MHCII_IMM_Rank")
<- MHC_II_BA %>% select(sample,chr,start,Result)
MHC_II_BA colnames(MHC_II_BA) <- c("sample_barcode","chr","pos","MHCII_BA_Rank")
<- left_join(MHC_I_BA,MHC_II_imm)
MHCIBA_IMM <- left_join(MHC_II_BA,MHC_II_imm)
MHCIIBA_IMM $MHCII_BA_Rank <- MHCIIBA_IMM$MHCII_BA_Rank/100
MHCIIBA_IMMsaveRDS(MHCIBA_IMM,"MHCIBA_IMM.rds")
saveRDS(MHCIIBA_IMM,"MHCIIBA_IMM.rds")
- Calculating the Pearson correlation coefficient between Binding affinity and immunogenicity.
<- readRDS("../data/MHCIBA_IMM.rds")
MHCIBA_IMM <- readRDS("../data/MHCIIBA_IMM.rds")
MHCIIBA_IMM <- na.omit(MHCIBA_IMM)
MHCIBA_IMM <- na.omit(MHCIIBA_IMM)
MHCIIBA_IMM <- rename(MHCIBA_IMM,BA_Rank = MHCI_BA_Rank)
MHCIBA_IMM <- rename(MHCIIBA_IMM,BA_Rank = MHCII_BA_Rank)
MHCIIBA_IMM <- function(data) {
split_sample = c()
r = c()
p for (i in unique(data$sample_barcode)) {
<- data[data$sample_barcode == i,]
sam if (nrow(sam)<=2) {
next
}
= cor.test(sam$BA_Rank/100,sam$IMM_Rank)
cor_test <- c(r,cor_test$estimate)
r <- c(p,cor_test$p.value)
p
}<- r*r
r2
<- p.adjust(p, method = "BH")
padj <- tibble(r,p,r2,padj)
x return(x)
}
<- split_sample(MHCIBA_IMM)
MHCI $Type <- "MHCI"
MHCImedian(MHCI$r2)
#> [1] 0.09680865
<- split_sample(MHCIIBA_IMM)
MHCII $Type <- "MHCII"
MHCIImedian(MHCII$r2)
#> [1] 0.2057099
<- rbind(MHCI,MHCII)
result ggplot(result,aes(r,fill = Type)) +
geom_density(alpha = 0.6) + theme_bw() + theme(axis.line = element_line(colour = "black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank())
ggsave("../figure/BA_IMM_r.pdf",dpi = 300,width = 6,height = 6)
ggplot(result,aes(r2,fill = Type)) +
geom_density(alpha = 0.6) + theme_bw()+ theme(axis.line = element_line(colour = "black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank())+xlab(bquote(R^2))
ggsave("../figure/BA_IMM_r2.pdf",dpi = 300,width = 6,height = 6)
ggplot(result,aes(padj,fill = Type)) +
geom_density(alpha = 0.6) + theme_bw()+ theme(axis.line = element_line(colour = "black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank())
ggsave("../figure/BA_IMM_padj.pdf",dpi = 300,width = 6,height = 6)
The correlations were positive, significant but low in magnitude, median Pearson r2 = 0.0968 in the compare between MHC I binding affinity rank and MHC II immunogenicity rank; and median Pearson r2 = 0.2057 in the compare between MHC II binding affinity rank and MHC II immunogenicity rank . Overall, this analysis suggested that the TLimmuno2 is at least partially non-redundant with the binding affinity predictor.
immunoediting signal detect
We analyzed all neoepitopes in TCGA using TLimmuno2 and set cutoff to 2% to obtain neoantigens.
#PBS_batch
#python
import os
import pandas as pd
import re
= "/public/slst/home/wanggsh/TCGA-process/Prediction/IMM_rank2"
file_path3 def get_neoantigen(Data):
= pd.DataFrame()
DF for i in Data["Sample_Barcode"].unique():
= Data[Data["Sample_Barcode"] == i]
Data_sample for x in Data_sample["chr"].unique():
= Data_sample[Data_sample["chr"] == x]
Data_chr for start in Data_chr["start"].unique():
= Data_chr[Data_chr["start"] == start]
Data_str = []
prediction for y in Data_str["hla"].unique():
= Data_str[Data_str["hla"] == y]
Data_hla = Data_hla["Rank"].min()
pre
prediction.append(pre)"Result"] = min(prediction)
Data_str[= Data_str[["tumor_sample_barcode","Sample_Barcode","chr","start","end","ref","alt","Result"]]
Data_final = Data_final.drop_duplicates()
Data_final = pd.concat([DF,Data_final])
DF "label"] = DF["Result"].map(lambda x : 1 if x <0.02 else 0)
DF[return DF
file = os.listdir(file_path3)
for i in file:
= re.split(".csv_rank.csv",i)[0]
tumor = pd.read_csv("{}/{}".format(file_path3,i))
Data = get_neoantigen(Data)
result "".format(tumor)) result.to_csv(
Then we can calculate and compare the median actual CCF-ES with median simulation CCF-ES in pan-cancer and cancer-type:
#R
<- readRDS("all_mut_mis_ccf.rds")
ccf <- list.files("~/Project/MHCII/Data/TCGA/neoantigen_es/",full.names = TRUE)
files <- lapply(files,
res function(x){
read.csv(x) %>% select(-X)
})<- bind_rows(res)
mut_all saveRDS(mut_all,file = "mut_all.rds")
<- mut_all %>%
mut_all mutate(index=paste(substr(tumor_sample_barcode,1,16),
sep = ":")) %>%
chr,start,ref,alt,select(index,Result,label)
<- inner_join(
all_mut_ccf
mut_all,%>% select(-sample),
ccf by="index"
)<- all_mut_ccf[!duplicated(all_mut_ccf$index),]
all_mut_ccf $sample <- substr(all_mut_ccf$index,1,16)
all_mut_ccf<- all_mut_ccf %>% rename(gene = Hugo_Symbol)
all_mut_ccf <- data.table::fread("tcga_RSEM_gene_tpm.gz") #too large,not include
TPM_log2 <- data.table::fread("mapping_probe")
mapping <- mapping[,1:2]
mapping <- left_join(TPM_log2 %>% rename(id=sample),mapping) %>%
TPM_log2_mapping select(id,gene,everything())###log2(tpm+0.001)
<- as.data.frame(TPM_log2_mapping)
TPM_log2_mapping <- TPM_log2_mapping
tpm_trans 3:ncol(tpm_trans)] <- apply(TPM_log2_mapping[,3:ncol(TPM_log2_mapping)],2,
tpm_trans[,function(x){(2^x) - 0.001})
saveRDS(tpm_trans,file = "tpm_trans.rds")
<- readRDS("tpm_trans.rds")
tpm <- tpm[!duplicated(tpm$gene),]
tpm
$tpm_exp <- mapply(function(sample,gene){
all_mut_ccf$gene==gene,substr(sample,1,15)]
tpm[tpm$sample,all_mut_ccf$gene)
},all_mut_ccf
<- all_mut_ccf %>%
all_mut_ccf_tpm filter(lengths(tpm_exp)!=0)
$tpm_exp <- as.numeric(all_mut_ccf_tpm$tpm_exp)
all_mut_ccf_tpm
<- all_mut_ccf_tpm %>%
all_mut_ccf_tpm mutate(neo=ifelse(label==1 & tpm_exp>1,"neo","not_neo")) #neo antigen& log expression >1
saveRDS(all_mut_ccf_tpm,file = "mut_ccf_tpm_neo.rds")
<- function(dt){
cal_nes_warp <- vector("list",length = length(unique(dt$sample)))
results_ccf names(results_ccf) <- unique(dt$sample)
<- makeCluster(getOption("cl.cores", 15),type="FORK")
cl <- parSapply(cl=cl,names(results_ccf),
results_ccf function(x){
<- dt %>% filter(sample == x)
data <- NeoEnrichment::cal_nes_new_test(dt = data,
a sample_counts = 1000,
need_p = FALSE)
return(a)
simplify = FALSE)
},stopCluster(cl)
<- Filter(function(x){length(x)>1},results_ccf)
results_ccf <- bind_rows(results_ccf)
pancancer_nes_ccf return(pancancer_nes_ccf)
#ES score
}
<- readRDS("../data/30_mut_ccf_tpm_neo.rds")
all_mut_ccf <- all_mut_ccf %>%
all_mut_ccf rename(ccf=ccf_hat) %>%
mutate(neo=ifelse(neo=="neo","yes","no"))
<- all_mut_ccf %>%
samples_has_subclonal filter(ccf<0.6) %>%
select(sample) %>%
distinct(sample)
%>%
all_mut_ccf filter(sample %in% samples_has_subclonal$sample) %>%
group_by(sample) %>%
summarise(c_n=sum(neo=="yes"),c_m=sum(neo=="no")) %>% filter(c_n>=1 & c_m >=1) -> summ
<- readRDS("../data/driver_mutations.rds")
driver_mutations #reduce driver mutation
<- all_mut_ccf %>%
all_mut_ccf mutate(gene_protein_change=paste(gene,Protein_Change,sep = "-"))
<- driver_mutations %>%
driver_mutations mutate(gene_protein_change=paste(gene,protein_change,sep = "-"))
<- all_mut_ccf %>%
all_mut_ccf mutate(is_driver=ifelse(gene_protein_change %in% driver_mutations$gene_protein_change,"yes","no"))
sum(all_mut_ccf$is_driver=="yes" & all_mut_ccf$neo=="yes") / sum(all_mut_ccf$neo=="yes")##0.01618271
%>% group_by(sample) %>%
all_mut_ccf summarise(inter_gene=intersect(gene[neo=="yes"],
=="yes"])) -> aaa##701 samples
gene[is_driver<- all_mut_ccf %>%
all_mut_ccf mutate(sample_neo_index=paste(sample,neo,gene,sep = ","))
<- aaa %>% mutate(sample_neo_index=paste(sample,"yes",inter_gene,sep = ","))
aaa
%>%
all_mut_ccf mutate(in_aaa = ifelse(sample_neo_index %in% aaa$sample_neo_index,"yes","no")) %>%
group_by(sample) %>%
summarise(need_sample=ifelse(any(in_aaa=="yes"),"no","yes")) %>%
filter(need_sample=="yes") -> summ2
<- intersect(samples_has_subclonal$sample,summ2$sample)
need_samples %>% filter(!is.na(ccf)) %>%
all_mut_ccf filter(sample %in% need_samples) %>%
group_by(sample) %>%
summarise(c_n=sum(neo=="yes"),c_m=sum(neo=="no")) %>% filter(c_n>=1 & c_m >=1) -> summ
<- all_mut_ccf %>% filter(sample %in% summ$sample)
neo_missense <- neo_missense %>% select(sample,neo,ccf) %>% filter(!is.na(ccf))
neo_missense
library(parallel)
<- cal_nes_warp(neo_missense)
neo_nes saveRDS(neo_nes,file = "../data/30_neo_es.rds")
#R
<- readRDS("../data/30_neo_es.rds")
neo_nes <- readRDS("../data/30_mut_ccf_tpm_neo.rds")
all_mut_ccf <- all_mut_ccf %>%
all_mut_ccf rename(ccf=ccf_hat) %>%
mutate(neo=ifelse(neo=="neo","yes","no"))
<- all_mut_ccf %>%
neo_missense filter(sample %in% neo_nes$sample)
<- neo_missense %>%
neo_missense select(sample,neo,ccf) %>%
filter(!is.na(ccf))
<- function(dt){
cal_nes_warp <- vector("list",length = length(unique(dt$sample)))
results_ccf names(results_ccf) <- unique(dt$sample)
<- makeCluster(getOption("cl.cores", 15),type="FORK")
cl <- parSapply(cl=cl,names(results_ccf),
results_ccf function(x){
<- dt %>% filter(sample == x)
data <- NeoEnrichment::cal_nes_new_test(dt = data,
a sample_counts = 1000,
need_p = FALSE)
return(a)
simplify = FALSE)
},stopCluster(cl)
<- Filter(function(x){length(x)>1},results_ccf)
results_ccf <- bind_rows(results_ccf)
pancancer_nes_ccf return(pancancer_nes_ccf)
}<- vector("list",2000)
res for (i in 1:2000){
<- neo_missense %>%
neo_missense_sim group_by(sample) %>%
mutate(neo_sim=sample(neo,length(neo))) %>%
ungroup()
<- neo_missense_sim %>%
neo_missense_sim select(-neo) %>%
rename(neo=neo_sim)
<- cal_nes_warp(neo_missense_sim)
neo_nes_sim $sim_num <- i
neo_nes_sim<- neo_nes_sim
res[[i]] print(paste0("Complete ",i," sim. "))
}saveRDS(res,file = "30_sim_2000_not_filter_driver.rds")
<- readRDS("30_neo_es.rds")
neo_nes <- readRDS("30_sim_2000_not_filter_driver.rds")
sim_all <- bind_rows(sim_all)
sim_all $cancer <- get_cancer_type(sim_all$sample,parallel = TRUE,cores = 20)
sim_allsaveRDS(sim_all,file = "30_sim_add_cancer.rds")
#R
<- readRDS("../data/neo_es.rds")
neo_nes <- readRDS("../data/sim_add_cancer.rds")
sim_all %>%
sim_all group_by(cancer,sim_num) %>%
summarise(median_es=median(es)) -> summ2
#> `summarise()` has grouped output by 'cancer'. You can override using the
#> `.groups` argument.
$cancer <- get_cancer_type(neo_nes$sample)
neo_nes<- neo_nes %>%
neo_nes_summ group_by(cancer) %>% summarise(median_es=median(es))
%>%
sim_all group_by(sim_num) %>%
summarise(median_es=median(es)) -> summ
<- WVPlots::ShadedDensity(frame = summ,
p xvar = "median_es",
threshold = median(neo_nes$es),
title = "",
tail = "left")
$layers[[1]]$aes_params$colour <- "red"
p$layers[[1]]$aes_params$size <- 1
p$layers[[2]]$aes_params$fill <- "blue" #geom_ribbon
p$layers[[3]]$aes_params$colour <- "black"
p$layers[[3]]$aes_params$size <- 1
p#p$layers[[4]]$aes_params$label <- "Actual median ES" #geom_text
<- p + labs(x="Simulation median es")+
p1 theme_prism()
p1
ggsave("../figure/null_dis.pdf")
#> Saving 8 x 5 in image
##cancer type
<- neo_nes_summ %>%
neo_nes_summ rowwise() %>%
mutate(p=mean(summ2$median_es[summ2$cancer==cancer] <= median_es))
<- function(dt,pancancer_p,dt2,median_es){
get_f1 <- ggplot(data=dt,aes(x=1,y=es))+
p1 geom_violin(alpha=0.7,width=0.5)+
geom_boxplot(width=0.2)+
theme_prism()+
labs(x=paste0("median es = ",round(median_es,digits = 3),"\n n = ",nrow(dt)),y = "ESccf",size=5)+
theme(axis.ticks.x = element_blank(),
axis.text.x = element_blank())+
theme(text = element_text(size = 10))+
annotate(geom="text", x=1, y=1.1, label=paste0("p = ",pancancer_p),
color="red",size=4)
$cancer <- get_cancer_type(dt$sample)
dt%>% group_by(cancer) %>%
dt summarise(median_est=median(es),c=n()) %>%
arrange(median_est) %>%
mutate(label=paste0(cancer,"\n(n=",c,")"))-> summ1
<- left_join(summ1,dt2 %>% select(cancer,p))
summ1 $p <- signif(summ1$p,digits = 1)
summ1<- summ1 %>%
summ1 mutate(sig=case_when(
<= 0.05 & p > 0.01 ~ "*",
p < 0.01 ~ "**",
p TRUE ~ "ns"
))<- left_join(dt,summ1)
dt $label <- factor(dt$label,levels = summ1$label)
dt<- data.frame(x = 1:nrow(summ1), y = 1.1, family = summ1$sig)
df2 <- ggplot(data=dt,aes(x=label,y=es))+
p2 geom_boxplot()+
theme_prism()+
theme(axis.title.x = element_blank())+
theme(axis.text.x = element_text(angle = 45,vjust = 1, hjust = 1))+
theme(text = element_text(size = 7))+
geom_text(data=df2,aes(x=x,y=y,label = family))+
geom_hline(yintercept=0,
color = "red", size=1)+labs(y = "")
<- p1 + p2 + plot_layout(widths = c(1, 8))
p3
}
<- get_f1(neo_nes,pancancer_p = "< 0.0005",
p2 dt2 = neo_nes_summ,median_es = median(neo_nes$es))
#> Joining, by = "cancer"
#> Joining, by = "cancer"
p2
ggsave("../figure/immuno_edit.pdf",width = 12,height = 5,dpi = 300)
we also use 30% cutoff to define neoantigen and calculate ESccf score:
#R
<- readRDS("../data/30_neo_es.rds")
neo_nes <- readRDS("../data/30_sim_add_cancer.rds")
sim_all %>%
sim_all group_by(cancer,sim_num) %>%
summarise(median_es=median(es)) -> summ2
#> `summarise()` has grouped output by 'cancer'. You can override using the
#> `.groups` argument.
$cancer <- get_cancer_type(neo_nes$sample)
neo_nes<- neo_nes %>%
neo_nes_summ group_by(cancer) %>% summarise(median_es=median(es))
%>%
sim_all group_by(sim_num) %>%
summarise(median_es=median(es)) -> summ
<- WVPlots::ShadedDensity(frame = summ,
p xvar = "median_es",
threshold = median(neo_nes$es),
title = "",
tail = "left")
$layers[[1]]$aes_params$colour <- "red"
p$layers[[1]]$aes_params$size <- 1
p$layers[[2]]$aes_params$fill <- "blue" #geom_ribbon
p$layers[[3]]$aes_params$colour <- "black"
p$layers[[3]]$aes_params$size <- 1
p#p$layers[[4]]$aes_params$label <- "Actual median ES" #geom_text
<- p + labs(x="Simulation median es")+
p1 theme_prism()
p1
ggsave("../figure/null_dis_30.pdf")
#> Saving 8 x 5 in image
##cancer type
<- neo_nes_summ %>%
neo_nes_summ rowwise() %>%
mutate(p=mean(summ2$median_es[summ2$cancer==cancer] <= median_es))
<- function(dt,pancancer_p,dt2,median_es){
get_f1 <- ggplot(data=dt,aes(x=1,y=es))+
p1 geom_violin(alpha=0.7,width=0.5)+
geom_boxplot(width=0.2)+
theme_prism()+
labs(x=paste0("median es = ",round(median_es,digits = 3),"\n n = ",nrow(dt)),y = "ESccf",size=5)+
theme(axis.ticks.x = element_blank(),
axis.text.x = element_blank())+
theme(text = element_text(size = 10))+
annotate(geom="text", x=1, y=1.1, label=paste0("p = ",pancancer_p),
color="red",size=4)
$cancer <- get_cancer_type(dt$sample)
dt%>% group_by(cancer) %>%
dt summarise(median_est=median(es),c=n()) %>%
arrange(median_est) %>%
mutate(label=paste0(cancer,"\n(n=",c,")"))-> summ1
<- left_join(summ1,dt2 %>% select(cancer,p))
summ1 $p <- signif(summ1$p,digits = 1)
summ1<- summ1 %>%
summ1 mutate(sig=case_when(
<= 0.05 & p > 0.01 ~ "*",
p < 0.01 ~ "**",
p TRUE ~ "ns"
))<- left_join(dt,summ1)
dt $label <- factor(dt$label,levels = summ1$label)
dt<- data.frame(x = 1:nrow(summ1), y = 1.1, family = summ1$sig)
df2 <- ggplot(data=dt,aes(x=label,y=es))+
p2 geom_boxplot()+
theme_prism()+
theme(axis.title.x = element_blank())+
theme(axis.text.x = element_text(angle = 45,vjust = 1, hjust = 1))+
theme(text = element_text(size = 7))+
geom_text(data=df2,aes(x=x,y=y,label = family))+
geom_hline(yintercept=0,
color = "red", size=1)+labs(y = "")
<- p1 + p2 + plot_layout(widths = c(1, 8))
p3
}
<- get_f1(neo_nes,pancancer_p = "0.14",
p2 dt2 = neo_nes_summ,median_es = median(neo_nes$es))
#> Joining, by = "cancer"
#> Joining, by = "cancer"
p2
ggsave("../figure/immuno_edit_30.pdf",width = 12,height = 5,dpi = 300)
In TCGA pan-cancer cohort, when samples with antigenic and driver mutations lying on the same gene are removed, the observed median ESCCF is -0.024 (n = 5,524, P-value < 0.0005). By comparing simulated median ES distribution, several cancer types including adrenocortical carcinoma (ACC), Thymoma (THYM), Esophageal carcinoma (ESCA), Cervical squamous cell carcinoma and endocervical adenocarcinoma (CESC), Cervical squamous cell carcinoma and endocervical adenocarcinoma (LUAD) show significant low ESCCF values
<- readRDS("../data/net_sim_add_cancer.rds")
sim_all <- readRDS("../data/net_neo_es.rds")
neo_nes %>%
sim_all group_by(cancer,sim_num) %>%
summarise(median_es=median(es)) -> summ2
#> `summarise()` has grouped output by 'cancer'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'cancer'. You can override using the
#> `.groups` argument.
$cancer <- get_cancer_type(neo_nes$sample)
neo_nes<- neo_nes %>%
neo_nes_summ group_by(cancer) %>% summarise(median_es=median(es))
%>%
sim_all group_by(sim_num) %>%
summarise(median_es=median(es)) -> summ
<- WVPlots::ShadedDensity(frame = summ,
p xvar = "median_es",
threshold = median(neo_nes$es),
title = "",
tail = "left")
$layers[[1]]$aes_params$colour <- "red"
p$layers[[1]]$aes_params$size <- 1
p$layers[[2]]$aes_params$fill <- "blue" #geom_ribbon
p$layers[[3]]$aes_params$colour <- "black"
p$layers[[3]]$aes_params$size <- 1
p#p$layers[[4]]$aes_params$label <- "Actual median ES" #geom_text
<- p + labs(x="Simulation median es")+
p1 theme_prism()
p1
ggsave("../figure/net_null_dis.pdf")
#> Saving 8 x 5 in image
##cancer type
<- neo_nes_summ %>%
neo_nes_summ rowwise() %>%
mutate(p=mean(summ2$median_es[summ2$cancer==cancer] <= median_es))
<- function(dt,pancancer_p,dt2,median_es){
get_f1 <- ggplot(data=dt,aes(x=1,y=es))+
p1 geom_violin(alpha=0.7,width=0.5)+
geom_boxplot(width=0.2)+
theme_prism()+
labs(x=paste0("median es = ",round(median_es,digits = 3),"\n n = ",nrow(dt)),size=7)+
theme(axis.ticks.x = element_blank(),
axis.text.x = element_blank())+
theme(text = element_text(size = 10))+
annotate(geom="text", x=1, y=1.1, label=paste0("p = ",pancancer_p),
color="red",size=4)+labs(y = "ESccf",size = 8)
$cancer <- get_cancer_type(dt$sample)
dt%>% group_by(cancer) %>%
dt summarise(median_est=median(es),c=n()) %>%
arrange(median_est) %>%
mutate(label=paste0(cancer,"\n(n=",c,")"))-> summ1
<- left_join(summ1,dt2 %>% select(cancer,p))
summ1 $p <- signif(summ1$p,digits = 1)
summ1<- summ1 %>%
summ1 mutate(sig=case_when(
<= 0.05 & p > 0.01 ~ "*",
p < 0.01 ~ "**",
p TRUE ~ "ns"
))<- left_join(dt,summ1)
dt $label <- factor(dt$label,levels = summ1$label)
dt<- data.frame(x = 1:nrow(summ1), y = 1.1, family = summ1$sig)
df2 <- ggplot(data=dt,aes(x=label,y=es))+
p2 geom_boxplot()+
theme_prism()+
theme(axis.title.x = element_blank())+
theme(axis.text.x = element_text(angle = 45,vjust = 1, hjust = 1))+
theme(text = element_text(size = 7))+
geom_text(data=df2,aes(x=x,y=y,label = family))+
geom_hline(yintercept=0,
color = "red", size=1)+labs(y = "")
<- p1 + p2 + plot_layout(widths = c(1, 8))
p3
}
<- get_f1(neo_nes,pancancer_p = 0.0005,
p2 dt2 = neo_nes_summ,median_es = median(neo_nes$es))
#> Joining, by = "cancer"
#> Joining, by = "cancer"
p2
ggsave("../figure/net_immunoediting.pdf",width = 12,height = 5,dpi = 300)