TLimmuno2: predicting MHC class II antigen immunogenicity through transfer learning

This report is written to help readers understand what and how we did in this project. Please read the formal manuscript <TLimmuno2: predicting MHC class II antigen immunogenicity through transfer learning> for more details.

This document is compiled from an Rmarkdown file which contains all code or description necessary to (auto-)reproduce the analysis for the accompanying project. Each section below describes a different component of the analysis and all numbers and figures are generated directly from the underlying data on compilation.

Dependencies

This project is depended on Python software, Python packages, R software, R packages and other tools :

Python:

  • pandas - flexible and easy to use data analysis and manipulation tool
  • numpy - a fundamental package for scientific computing with Python
  • tensorflow - a open source platform for machine learning
  • sklearn - machine learning in Python
  • matplotlib - plot fig in python.
  • seaborn - visualization library based on matplotlib
  • sys,os,re - some base Python packages

R:

  • reticulate - R Interface to Python
  • NeoEnrichment - do neoantigen enrichment for this project.
  • tidyverse - tidy data
  • data.table - read and tidy data
  • survival - survival analysis
  • survminer - plot survival fit
  • DT - show data table as a table in html
  • patchwork - arrange figures into complex compound figures
  • knitr, rmdformats - use to compile this file

Shell:

  • netMHCIIpan(IEDB) - use to predict peptide binding affinity for MHCII molecules(download form IEBD, contain BA and EL method)
  • IEDB tools - predict the allele independent CD4 T cell immunogenicity at population level
  • MixMHC2pred - a pan-allele predictor of MHC class II ligands and epitopes
  • Repitope - a structured framework of quantitative prediction of immunogenicity
  • annovar - annotate genetic variants detected from diverse genomes

Summary figure

summary figure

Basic information

Here are some common scripts used in our project:

Of note, some step was done on Linux server provided by High Performance Computing(HPC) Service of ShanghaiTech University in order to reduce run time. So some scripts was sunmitted by using Portable Bash Script (PBS) scripts, we will mark them in the header by using #PBS. Also, there are mixture of R and Python, if readers want to reproduce this work, please focus on the key code lines.

1.PBS script format: we change name, out file path and python file in different work in order to submit it into HPC. File submitted by PBS will not be compiled by this report.

#PBS -N <name>
#PBS -o <out file path>
#PBS -e <out file path>
#PBS -l nodes=1:ppn=1
#PBS -l walltime=20000:00:00  
#PBS -S /bin/bash   
#PBS -q pub_gpu 

if [ -f "/public/slst/home/wanggsh/miniconda3/etc/profile.d/conda.sh" ]; then
        . "/public/slst/home/wanggsh/miniconda3/etc/profile.d/conda.sh"
    else
        export PATH="/public/slst/home/wanggsh/miniconda3/bin:$PATH"
    fi

conda activate tensorflow
python <Python file>

2.BLOSUM62 encoder: we use BLOSUM62 matrix to encoder peptides and MHC pseudo-sequence into numeric. Blosum62 Matrix were download from NCBI website. The animo acid alphabet is ACDEFGHIKLMNPQRSTVWYX, X is the general wildcard character. You can find these content in USCS and Mathworks. The following is the script:

#python
import pandas as pd
import numpy as np

Blosum62_matrix = pd.read_csv(r"~/Data/MHCII/BLOSUM62.csv",comment="#")
Protein_alphabet = list("ARNDCQEGHILKMFPSTWYVX")
Blosum62_matrix = Blosum62_matrix[Protein_alphabet]
Blosum62_matrix = Blosum62_matrix.loc[Protein_alphabet]
Blosum62_matrix

def blosum62(peptide,maxlen):
    encoder = np.empty((maxlen,21))
    if len(peptide) <=maxlen:
        peptide = peptide + "X"*(maxlen-len(peptide))
    for i in range(len(peptide)):
        pep = list(peptide)[i]
        coder = Blosum62_matrix[pep]
        encoder[i] = coder
    return encoder.flatten()

if __name__ == "__main__":
    blosum62(peptide,maxlen)

3.Creating a batch of PBS job:In some steps, the data is to large, even the HPC need a lot of time to finish. So we create a batch of PBS work to speed up the analysis process. These script were marked by using PBS_batch in the header.File submitted by PBS will not be compiled by this report.

#python
import os
file_path = <file path>
file = os.listdir(file_path)
pbs_path = <pbs path>
for i in file:
    pbs_text = r"""#PBS -N {0}
#PBS -o <out file path>
#PBS -e <out file path>
#PBS -l nodes=1:ppn=20
#PBS -l walltime=20000:00:00  
#PBS -S /bin/bash   
#PBS -q pub_gpu 

if [ -f "/public/slst/home/wanggsh/miniconda3/etc/profile.d/conda.sh" ]; then
        . "/public/slst/home/wanggsh/miniconda3/etc/profile.d/conda.sh"
    else
        export PATH="/public/slst/home/wanggsh/miniconda3/bin:$PATH"
    fi

conda activate tensorflow
python <Python file>
""".format(i)
    os.system("echo \"{0}\" > {1}/{2}.pbs".format(pbs_text,pbs_path,i))
    os.system("qsub {0}/{1}.pbs".format(pbs_path,i))
    print("{} submitted".format(i))

4.Background sequence: TLimmuno2 outputs a continuous variable between 0 and 1 and also percentile rank. we download all human peptide sequence from UniPort database(2022.3) and randomly sampling peptides. We generated a set of 90,000 13-21-mer random natural peptides (10,000 of each length). For each peptide-MHC II pair, percentile ranks were computed based on background sequence, smaller rank indicates stronger immunogenicity. The result maybe different as the reason of random select.

#python
import pandas as pd
import numpy as np
import random
#uniport data
f=open("../data/uniport_human.fasta")
ls=[]
for line in f:
        if not line.startswith('>'):
                ls.append(line.replace('\n',''))
f.close()
uniport = pd.DataFrame(ls)
uniport["length"] = uniport[0].map(len)
bool = uniport["length"]>30
uniport = uniport[bool]
bool = uniport[0].apply(lambda x: set(list(x)).issubset(set(list("ARNDCQEGHILKMFPSTWYVX"))))
uniport = uniport[bool]
uniport_random = uniport[0].values
def random_select(x,length,num,allele):
    df = []
    all = []
    for i in range(num):
        protein_num = random.randint(0,len(x)-1)
        protein = x[protein_num]
        peptide_num = random.randint(0,len(protein)-length)
        peptide = protein[peptide_num:peptide_num+length]
        df.append(peptide)
        all.append(allele)
    return df,all
Pep = []
for i in range(13,21):
    pep,all = random_select(uniport_random,i,10000,"ALL")
    Pep.extend(pep)
IMM_bg_pep = pd.DataFrame({"pep":Pep})
#IMM_bg_pep.to_csv("../data/IMM_bg_pep.csv")

5.ROC significant: we calculated the p-value between the two ROC’s by Delong’s test.

import numpy as np
from matplotlib import pyplot as plt
import scipy.stats as st
from sklearn import metrics

class DelongTest():
    def __init__(self,preds1,preds2,label,threshold=0.05):
        '''
        preds1:the output of model1
        preds2:the output of model2
        label :the actual label
        '''
        self._preds1=preds1
        self._preds2=preds2
        self._label=label
        self.threshold=threshold
        self.show_result()

    def _auc(self,X, Y)->float:
        return 1/(len(X)*len(Y)) * sum([self._kernel(x, y) for x in X for y in Y])

    def _kernel(self,X, Y)->float:
        '''
        Mann-Whitney statistic
        '''
        return .5 if Y==X else int(Y < X)

    def _structural_components(self,X, Y)->list:
        V10 = [1/len(Y) * sum([self._kernel(x, y) for y in Y]) for x in X]
        V01 = [1/len(X) * sum([self._kernel(x, y) for x in X]) for y in Y]
        return V10, V01

    def _get_S_entry(self,V_A, V_B, auc_A, auc_B)->float:
        return 1/(len(V_A)-1) * sum([(a-auc_A)*(b-auc_B) for a,b in zip(V_A, V_B)])
    
    def _z_score(self,var_A, var_B, covar_AB, auc_A, auc_B):
        return (auc_A - auc_B)/((var_A + var_B - 2*covar_AB )**(.5)+ 1e-8)

    def _group_preds_by_label(self,preds, actual)->list:
        X = [p for (p, a) in zip(preds, actual) if a]
        Y = [p for (p, a) in zip(preds, actual) if not a]
        return X, Y

    def compute_z_p(self):
        X_A, Y_A = self._group_preds_by_label(self._preds1, self._label)
        X_B, Y_B = self._group_preds_by_label(self._preds2, self._label)

        V_A10, V_A01 = self._structural_components(X_A, Y_A)
        V_B10, V_B01 = self._structural_components(X_B, Y_B)

        auc_A = self._auc(X_A, Y_A)
        auc_B = self._auc(X_B, Y_B)

        # Compute entries of covariance matrix S (covar_AB = covar_BA)
        var_A = (self._get_S_entry(V_A10, V_A10, auc_A, auc_A) * 1/len(V_A10)+ self._get_S_entry(V_A01, V_A01, auc_A, auc_A) * 1/len(V_A01))
        var_B = (self._get_S_entry(V_B10, V_B10, auc_B, auc_B) * 1/len(V_B10)+ self._get_S_entry(V_B01, V_B01, auc_B, auc_B) * 1/len(V_B01))
        covar_AB = (self._get_S_entry(V_A10, V_B10, auc_A, auc_B) * 1/len(V_A10)+ self._get_S_entry(V_A01, V_B01, auc_A, auc_B) * 1/len(V_A01))

        # Two tailed test
        z = self._z_score(var_A, var_B, covar_AB, auc_A, auc_B)
        p = st.norm.sf(abs(z))*2

        return z,p

    def show_result(self):
        z,p=self.compute_z_p()
        return z,p
        #print(f"z score = {z:.5f};\np value = {p:.5f};")
        #if p < self.threshold :print("There is a significant difference")
        #else:        print("There is NO significant difference")

The construction process of the BA model

This part focuses on Binding affinity (BA) model construction. it contains the following sections:

Training data

Data download

The same binding affinity data used for training NetMHCIIpan were used to train our model,you can find the data in IEDB, the details are described as below:

1.data processing: combine & drop duplicates

#R
path <- "../data/NetMHCIIpan_train/"
read_file <- function(name,path) {
  df = tibble()
  for (i in 1:5) {
    x <-  data.table::fread(str_c(path,name,i,".txt"),data.table = F,header = F)
    df <- rbind(df,x)
  }
  return(df)
}
train_BA <- read_file("train_BA",path)
test_BA <- read_file("test_BA",path)

pseudo_sequence <- data.table::fread("../data/pseudosequence.2016.all.X.dat",data.table = F,header = F)#pseudo seq for MHC
BA <- rbind(train_BA,test_BA)
BA_filtered <- BA %>% filter(str_detect(V3,"D")) %>% select(V1,V2,V3)
colnames(pseudo_sequence) <- c("MHC","sequence")
colnames(BA_filtered) <- c("peptide","IC50","MHC")
BA_filtered_MHC <- left_join(BA_filtered,pseudo_sequence)
#> Joining, by = "MHC"
#na.omit(BA_filter_MHC)
BA_filtered_MHC <- unique(BA_filtered_MHC)
dim(BA_filtered_MHC)
#> [1] 107008      4
length(table(unique(BA_filtered_MHC$MHC)))
#> [1] 71

These data comprise 107,008 measurements of peptide–MHC binding affinity covering 71 types of class II MHCs from humans.

2.change normalized IC50 to binary label: IC50 is normalized by log50k transformed binding affinity (i.e. 1 - log50k( aff nM)) method, and a threshold of 500 nM is used. This means that peptides with log50k transformed binding affinity values greater than 0.426 are classified as binders(true label: 1)

#R
BA_filtered_MHC["label"] = ifelse(BA_filtered_MHC$IC50>0.426,1,0) #set positive and negative
write_csv(BA_filtered_MHC,"../data/BA_MHC_label.csv")
BA_allele <- unique(BA_filtered_MHC$MHC)
write_lines(BA_allele,"../data/BA_allele.csv")

Show data feature

#python
BA = pd.read_csv("../data/BA_MHC_label.csv")
BA["length"] = BA["peptide"].map(len)
BA_positive = BA[BA["label"] == 1]
BA_negative = BA[BA["label"] == 0]

def MHC_dis_plot(Data,color,colname):
    value = Data[colname].value_counts().values
    index = Data[colname].value_counts().index
    norm_value = value/sum(value)
    a = list(index[0:15])
    b = list(norm_value[0:15])
    a.append("other ({})".format(len(value)-15))
    b.append(sum(norm_value[15:]))
    fig,ax = plt.subplots(figsize = (8,7))
    plt.bar(a,b,color = color)
    plt.ylim([0,0.45])
    plt.yticks([0.00,0.05,0.10,0.15,0.20,0.25,0.30,0.35,0.40])
    plt.xticks(rotation = 90)

We show the distribution of allele in different labels:

#python
MHC_dis_plot(BA_positive,"green","MHC")
plt.tight_layout()
plt.savefig("../figure/BA_MHC_positive.pdf",dpi = 300,transparent=True)
plt.show()

#python
MHC_dis_plot(BA_negative,"red","MHC")
plt.tight_layout()
plt.savefig("../figure/BA_MHC_negative.pdf",dpi = 300,transparent=True)
plt.show()

The converted affinity value of negative samples is around 0.2, while that of positive samples is around 0.6:

#python
sns.boxplot(data = BA,x="label",y="IC50")
plt.ylim([-0.1,1.1])
#> (-0.1, 1.1)
plt.xlabel("Label")
plt.ylabel("Binding affinity")
plt.yticks(np.arange(0,1.2,0.2))
#> ([<matplotlib.axis.YTick object at 0x28e989970>, <matplotlib.axis.YTick object at 0x28e989670>, <matplotlib.axis.YTick object at 0x28eddcd00>, <matplotlib.axis.YTick object at 0x28ebdeeb0>, <matplotlib.axis.YTick object at 0x28e991940>, <matplotlib.axis.YTick object at 0x28e991df0>], [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])
plt.xticks((0,1),labels = ["Negative","Positive"])
#> ([<matplotlib.axis.XTick object at 0x28ebe3850>, <matplotlib.axis.XTick object at 0x28ebe3670>], [Text(0, 0, 'Negative'), Text(1, 0, 'Positive')])
plt.savefig("../figure/BA_label_boxplot.pdf",dpi = 300,transparent=True)
plt.show()

15mer peptides account for the largest proportion of Binding affinity data.

#python
pos_len_value = list(BA_positive["length"].value_counts().values)
pos_len_index = BA_positive["length"].value_counts().index
norm_pos_len_value = BA_positive["length"].value_counts().values/sum(BA_positive["length"].value_counts())
neg_len_value = list(BA_negative["length"].value_counts().values)
neg_len_index = BA_negative["length"].value_counts().index
norm_neg_len_value = BA_negative["length"].value_counts().values/sum(BA_negative["length"].value_counts())
fig,ax = plt.subplots()
width = 0.35
x = pos_len_index
ax.bar(pos_len_index - width/2, norm_pos_len_value,width,label='immunogenic',color = "green")
#> <BarContainer object of 9 artists>
ax.bar(neg_len_index + width/2, norm_neg_len_value, width, label='noimmunogenic',color = "red")
#> <BarContainer object of 9 artists>
ax.set_ylabel('proportion')
ax.set_xlabel('length')
ax.set_title('Distribution of Length')
ax.set_xticks(x)
ax.legend()
plt.savefig("../figure/length_dis_label.pdf",dpi = 300,transparent=True)
plt.show()

BA model training

Unlike NetMHCIIpan4.0, we use the LSTM layer to construct our model. Antigen sequences and MHC pseudo-sequences are used as inputs. We used two LSTM layers with output sizes of 128,64 on top of the antigen and MHC inputs. The LSTM outputs of antigen and MHC are concatenated in the same layer to form a 128-dimensional vector. This layer is followed by three dense layers with 100, 50, and 10 neurons activated by RELU, and the last output layer is a single neuron dense layer activated by sigmoid.

The input of this model is the peptides sequence and the MHC sequence. The output is whether the peptides bind to the MHC molecule or not. We used 5-fold cross-validation and the training results are shown in training histrory.

Structure

Here are the script that are used to train the BA model, this job is done on HPC:

#PBS
#python
import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split,KFold
from sklearn.metrics import auc,precision_recall_curve,roc_curve,confusion_matrix
import sys
sys.path.append("/public/slst/home/wanggsh/Project/MHCII/utils")
from NN_data import NN_Data
from Blosum62 import blosum62
import matplotlib.pyplot as plt
from collections import Counter
import os

version = 2

BA = pd.read_csv("/public/slst/home/wanggsh/Data/MHCII/netMHCpanII4.0_train_data/BA_data/BA_MHC_label.csv")
BA = BA.drop_duplicates()
BA = BA.reset_index(drop=True)
#pep
BA["pep_blosum"] = BA["peptide"].apply(blosum62,args=(21,))
BA["MHC_blosum"] = BA["sequence"].apply(blosum62,args=(34,))
peptide = np.empty((len(BA),21,21))
for i in range(len(BA)):
    peptide[i] = BA["pep_blosum"][i].reshape((21,21))
#mhc 
MHC = np.empty((len(BA),34,21))
for i in range(len(BA)):
    MHC[i] = BA["MHC_blosum"][i].reshape((34,21))
#score
labels = BA["score"].values

#class weight
neg, pos = np.bincount(BA["score"])
total = neg + pos
print('Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))
weight_for_0 = (1 / neg) * (total / 2.0)
weight_for_1 = (1 / pos) * (total / 2.0)
class_weight = {0: weight_for_0, 1: weight_for_1}
print('Weight for class 0: {:.2f}'.format(weight_for_0))
print('Weight for class 1: {:.2f}'.format(weight_for_1))

#split test
pep_Train,pep_test,MHC_Train,MHC_test,label_Train,label_test = train_test_split(peptide,MHC,labels,test_size=0.1,random_state=202205,stratify=labels)#random_state=202205
print(Counter(label_Train))
print(Counter(label_test))

def ELmodel():
    inputA = tf.keras.Input(shape = (21,21))
    inputB = tf.keras.Input(shape = (34,21))
    
    x = tf.keras.layers.LSTM(128,return_sequences=True)(inputA)
    x = tf.keras.layers.LSTM(64,return_sequences=True)(x)
    x = tf.keras.layers.Flatten()(x)
    x = tf.keras.Model(inputs = inputA,outputs = x)

    y = tf.keras.layers.LSTM(128,return_sequences=True)(inputB)
    y = tf.keras.layers.LSTM(64,return_sequences=True)(y)
    y = tf.keras.layers.Flatten()(y)
    y = tf.keras.Model(inputs = inputB,outputs = y)

    combined = tf.keras.layers.concatenate([x.output,y.output])

    z = tf.keras.layers.Dense(100,activation="relu")(combined)
    z = tf.keras.layers.Dense(50,activation = "relu")(z)
    z = tf.keras.layers.Dense(10,activation = "relu")(z)
    z = tf.keras.layers.Dense(1,activation = "sigmoid")(z)

    model = tf.keras.Model(inputs = [inputA,inputB],outputs = z)

    return model


callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)
Loss = keras.losses.BinaryCrossentropy()
Optimizer = keras.optimizers.Adam(learning_rate = 0.00008)
Metrics = [tf.keras.metrics.AUC(),tf.keras.metrics.BinaryAccuracy(),tf.keras.metrics.AUC(curve = "PR"),tf.keras.metrics.Precision()] 
Batch_size= 512
Epochs= 100
Verbose = 2 #reduce information output

print("Cross Validation")
kf = KFold(n_splits=5,shuffle = True)
ROC = {}
PR = {}
i = 1
for train_index,val_index in kf.split(label_Train):
    pep_train,MHC_train,label_train = pep_Train[train_index],MHC_Train[train_index],label_Train[train_index]
    pep_val,MHC_val,label_val = pep_Train[val_index],MHC_Train[val_index],label_Train[val_index]
    print(Counter(label_train))
    print(Counter(label_val))
    model = ELmodel()
    model.compile(
        loss = Loss,
        optimizer = Optimizer,
        metrics = Metrics)
    #callback
    history = model.fit([pep_Train,MHC_Train],label_Train,
        batch_size=Batch_size,epochs=Epochs,
        validation_data = ([pep_val,MHC_val],label_val),
        callbacks = callback,verbose = Verbose,
        class_weight=class_weight)
    His_df = pd.DataFrame(history.history)
    His_df.to_csv("/public/slst/home/wanggsh/Project/Saved_model/BA_model_history/corss_history{0}.csv".format(i))
    prediction = model.predict([pep_val,MHC_val])
    fpr,tpr,thresholds = roc_curve(label_val,prediction,pos_label=1)
    area_mine = auc(fpr,tpr)
    ROC["flod{}".format(i)] = [fpr,tpr,thresholds,area_mine]
    print("=============={} auc data finish==========".format(i))
    precision,recall,thresholds = precision_recall_curve(label_val,prediction,pos_label=1)
    area_PR = auc(recall,precision)
    PR["flod{}".format(i)] = [recall,precision,thresholds,area_PR]
    print("=============={} pr data finish==========".format(i))
     i = i+1

#----plot cross fig
ROC_df = pd.DataFrame(ROC)
fig = plt.figure()
lw = 2
ax = fig.add_subplot(111)
ax.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
for i in range(5):
    ax.plot(ROC_df.iloc[0,i],ROC_df.iloc[1,i],label='flod{0} (area = {1:.4f})'.format(i,ROC_df.iloc[3,i]))
ax.set_ylim([0,1.05])
ax.set_xlim([0,1])
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('Receiver operating characteristic example')
ax.legend(loc="lower right")
plt.savefig("/public/slst/home/wanggsh/Project/Saved_model/Model_fig/BA_model/BA_ROC_cross.pdf",dpi = 300)
PR_df = pd.DataFrame(PR)
plt.figure()
lw = 2
for i in range(5):
    plt.plot(PR_df.iloc[0,i],PR_df.iloc[1,i],lw=lw, label = 'flod{0} (area = {1:.4f})'.format(i,PR_df.iloc[3,i]))
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('PR curve')
plt.legend(loc="lower right")
plt.savefig("/public/slst/home/wanggsh/Project/Saved_model/Model_fig/BA_model/BA_PR_cross.pdf",dpi = 300)
print("start model training")

model = ELmodel()

model.compile(
        loss = Loss,
        optimizer = Optimizer,
        metrics = Metrics)
#callback
history = model.fit([pep_Train,MHC_Train],label_Train,
        batch_size=Batch_size,epochs=Epochs,
        validation_data = ([pep_test,MHC_test],label_test),
        callbacks = callback,verbose = Verbose,
        class_weight=class_weight)
model.save(r"/public/slst/home/wanggsh/Project/Saved_model/new_BA_model/model1")

print("start plt fig")
His_df = pd.DataFrame(history.history)
His_df.to_csv("/public/slst/home/wanggsh/Project/Saved_model/BA_model_history/history.csv")

prediction = model.predict([pep_test,MHC_test])
Train_pre = model.predict([pep_Train,MHC_Train])
fpr,tpr,thresholds = roc_curve(label_test,prediction,pos_label=1)
train_fpr,train_tpr,_ = roc_curve(label_Train,Train_pre,pos_label=1)
area_mine = auc(fpr,tpr)
area_mine_train = auc(train_fpr,train_tpr)
print(area_mine)
print(area_mine_train)

precision,recall,thresholds = precision_recall_curve(label_test,prediction,pos_label=1)
precision_train,recall_train,_ = precision_recall_curve(label_Train,Train_pre,pos_label=1)
area_PR = auc(recall,precision)
area_PR_train = auc(recall_train,precision_train)

fig = plt.figure()
lw = 2
ax = fig.add_subplot(111)
ax.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
ax.plot(train_fpr,train_tpr,label='Train AUC = (area = {0:.4f})'.format(area_mine_train))
ax.plot(fpr,tpr,label='Test AUC = (area = {0:.4f})'.format(area_mine))
ax.set_ylim([0,1.05])
ax.set_xlim([0,1])
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('Receiver operating characteristic example')
ax.legend(loc="lower right")
plt.savefig("/public/slst/home/wanggsh/Project/Saved_model/Model_fig/BA_model/ROC.pdf",dpi = 300)

plt.figure()
lw = 2
plt.plot(recall_train,precision_train,lw=lw, label = 'Train AUC = (area = {0:.4f})'.format(area_PR_train))
plt.plot(recall,precision,lw=lw, label = 'Test AUC = (area = {0:.4f})'.format(area_PR))
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('PR curve example')
plt.legend(loc="lower right")

plt.savefig("/public/slst/home/wanggsh/Project/Saved_model/Model_fig/BA_model/PR.pdf",dpi = 300)

print("finish")

Train history fig

We use 5-fold cross validation to validate our model. Cross validation: The training results of the model are very stable, the average AUC of ROC is 0.97, and the average AUC of PR is 0.96.

BA corss ROC

BA cross PR

held out dataset:

BA ROC

BA PR

BA model perforemance

We compare the performance of our BA with other models on an independent dataset. we compare netMHCIIpan 4.0 BA and mixMHC2pred.

Benchmark data preprecess

We collected independent datasets Automated from Andreatta et al. and drop duplicated data with training dataset. we also change binding affinity into binary label by log50k transformed method.

#R
Dataset <- read_table("../data/mhcii_initial_benchmark_datasets.tsv")
Dataset_binary <- Dataset %>% filter(measurement_type == "binary")
Dataset_ic50 <- Dataset %>% filter(measurement_type == "ic50")
fun <- function(x){
  if(x<=500){y=1}
  else{y=0}
  return(y)
}
quativate <- sapply(Dataset_ic50$measurement_value, fun)
Dataset_ic50$measurement_value <- quativate
Dataset_final <- rbind(Dataset_binary,Dataset_ic50)
pseduo_seq <- read_table("../data/pseudosequence.2016.all.X.dat",col_names = c("allele","sequence"))
fun <- function(x){
  if (str_detect(x,"\\/")){
    x = str_remove_all(x,"\\*")
    x = str_remove_all(x,"\\:")
    x = str_replace(x,"\\/","-")
  }
  else{
    x = str_replace(x,"HLA-","")
    x = str_replace(x,"\\*","_")
    x = str_replace(x,"\\:","")
  }
  return(x)
}
Dataset_final$allele1 = sapply(Dataset_final$allele,fun)
Dataset_final <- left_join(Dataset_final,pseduo_seq, by = c("allele1" = "allele"))
Dataset_final <-  na.omit(Dataset_final)
Dataset_final$length <- sapply(Dataset_final$peptide_sequence,nchar)
Dataset_final <- Dataset_final %>% filter(length>=13&length<=21)
BA <- read_csv("../data/BA_MHC_label.csv")
Automated_filter <- anti_join(Dataset_final,BA,by = c("peptide_sequence" = "peptide","allele1"="MHC")) #Delete the data contained in the training dataset
write_csv(Dataset_final,"../data/Automated_benchmark.csv")
write_csv(Automated_filter,"../data/Automated_benchmark_filter.csv")

The Automated dataset are shown below:

DT::datatable(Automated_filter,rownames = FALSE)

Result

The process of using the different tools is shown below, all processes are done on HPC.

BAmodel:

#python
import tensorflow as tf
import pandas as pd
import numpy as np
model = tf.keras.models.load_model("/public/slst/home/wanggsh/Project/Saved_model/new_BA_model/model")
Data = pd.read_feather("/public/slst/home/wanggsh/Data/MHCII/Benchmark/Automated_benchmark_filter.feather")
BA_allele = pd.read_csv("/public/slst/home/wanggsh/Data/MHCII/netMHCpanII4.0_train_data/BA_data/BA_allele.csv",header=None,names=["allele"])
bool = Data["Allele Name1"].map(lambda X : X in BA_allele["allele"].values)
Data = Data[bool] 
print(len(Data))
print("Data finish")
Data = Data.reset_index(drop = True)
peptide = np.empty((len(Data),21,21))
for i in range(len(Data)):
    peptide[i] = Data["pep_blosum"][i].reshape((21,21))
#mhc
MHC = np.empty((len(Data),34,21))
for i in range(len(Data)):
    MHC[i] = Data["MHC_blosum"][i].reshape((34,21))

label = Data['label'].values
Y = model.predict([peptide,MHC]).flatten()
Data["prediction"] = Y

allele = Data["Allele Name1"].unique()
BA_result = Data
BA_result.to_csv("/public/slst/home/wanggsh/Data/MHCII/Benchmark/BA_model/BA_Automated_prediction_result.csv")

NetMHCIIpan:

#python
import pandas as pd
import os
import sys

result_file_path = "/public/slst/home/wanggsh/Data/MHCII/Benchmark"
EL = pd.read_csv("/public/slst/home/wanggsh/Data/MHCII/Benchmark/Automated_benchmark_filter.csv")
EL.index = range(len(EL))
allele = EL["Allele Name"].unique()
PD = pd.DataFrame()
comd = "python /public/slst/home/wanggsh/biosoft/mhc_ii/mhc_II_binding.py netmhciipan_ba"
file_path = "{0}/pep.csv".format(result_file_path)

for i in allele:
    aEL = EL[EL["Allele Name"] == i]
    length = aEL["length"].unique()
    for l in length:
        lEL = aEL[EL["length"] == l] 
        lEL["Description"].to_csv("{0}".format(file_path),header = False,index = False)
        os.system("{0} {1} {2} {3} > {4}/result.txt".format(comd,i,file_path,l,result_file_path))
        result = pd.read_table("{0}/result.txt".format(result_file_path))
        PD= result.append(PD)
    PD.to_csv("{0}/netMHCIIpan_Auto_filter_ba_result.csv".format(result_file_path))
print("finish")

mixMHC2pred:

#python
import pandas as pd
import os
import sys
import re

result_file_path = "/public/slst/home/wanggsh/Data/MHCII/Benchmark"
EL = pd.read_csv("/public/slst/home/wanggsh/Data/MHCII/Benchmark/Automated_benchmark_filter.csv")
def allele_change(x):
    if len(x) < 10:
        x = x[0:7]+"_"+x[7:]
    else :
        x = re.sub("HLA-","",x)
        x = x[0:4]+"_"+x[4:6]+"_"+x[6:13]+"_"+x[13:15]+"_"+x[15:]
        x = re.sub("-","__",x)
    return x
EL["new_allele"] = EL["Allele Name1"].map(allele_change)
EL.to_csv("/public/slst/home/wanggsh/Data/MHCII/Benchmark/mixMHC2pred_Automated.csv")
allele = EL["new_allele"].unique()
PD = pd.DataFrame()
comd = "/public/slst/home/wanggsh/biosoft/MixMHC2pred-master/MixMHC2pred_unix --no_context"
file_path = "{0}/pep.csv".format(result_file_path)
result = "{0}/result.txt".format(result_file_path)

for i in allele:
    aEL = EL[EL["new_allele"] == i]
    aEL["Description"].to_csv("{0}".format(file_path),header = False,index = False)
    os.system("{0} -i {1} -o {2} -a {3}".format(comd,file_path,result,i))
    result_file = pd.read_table("{0}".format(result),comment="#")
    PD = pd.concat([result_file,PD])
    
PD.to_csv("{0}/mixMHC2pred_Auto_filter_ba_result.csv".format(result_file_path))
print("finish")

Plt ROC fig

The area under the curve (AUC) of the receiver operating characteristic (ROC) is used as the evaluation index for model comparations:

#python
Automate = pd.read_csv("../data/Automated_benchmark.csv")
BA_result = pd.read_csv("../data/BA_Automated_prediction_result.csv")
BA_fpr,BA_tpr,_ = roc_curve(BA_result["label"],BA_result["prediction"],pos_label=1)
BA_roc = auc(BA_fpr,BA_tpr)
netMHC_result = pd.read_csv("../data/netMHCIIpan_Auto_filter_ba_result.csv")
netMHC_result = netMHC_result[["allele","percentile_rank","peptide","ic50"]]
Automate = Automate[["allele","allele1","peptide_sequence","measurement_value"]]
netMHC_result = pd.merge(netMHC_result,Automate,left_on=["allele","peptide"],right_on=["allele","peptide_sequence"]).drop_duplicates()
BA_allele = pd.read_csv("../data/BA_allele.csv",header=None,names=["allele"])
bool = netMHC_result["allele1"].map(lambda x : x in BA_allele["allele"].values)
netMHC_result = netMHC_result[bool]
net_fpr,net_tpr,_ = roc_curve(netMHC_result["measurement_value"],1-netMHC_result["percentile_rank"],pos_label=1)
net_roc = auc(net_fpr,net_tpr)
mix_result = pd.read_csv("../data/mixMHC2pred_Auto_filter_ba_result.csv")
mix_data = pd.read_csv("../data/mixMHC2pred_Automated.csv")
mix_result = pd.merge(mix_result[["Peptide","BestAllele","%Rank_best"]],mix_data[["Description","new_allele","label","Allele Name1"]],left_on =["Peptide","BestAllele"],right_on=["Description","new_allele"]).drop_duplicates()
mix_result = mix_result[mix_result["Allele Name1"].isin(BA_allele["allele"])]
mix_fpr,mix_tpr,_ = roc_curve(mix_result["label"],1-mix_result["%Rank_best"],pos_label=1)
mix_roc = auc(mix_fpr,mix_tpr)
cmap = plt.get_cmap('Set1')
colors = [mpl.colors.rgb2hex(cmap(i)[:3]) for i in range(3)]
fig,ax = plt.subplots(figsize = (8,8))
lw = 2
plt.plot(BA_fpr,BA_tpr,lw=lw,color = colors[0],label="BA_model (AUC = %0.4f)" % BA_roc,)
#> [<matplotlib.lines.Line2D object at 0x28ee48910>]
plt.plot(net_fpr,net_tpr,lw=lw,color = colors[1],label="NetMHCIIpan_BA (AUC = %0.4f)" % net_roc,)
#> [<matplotlib.lines.Line2D object at 0x28ee48640>]
plt.plot(mix_fpr,mix_tpr,lw=lw,color = colors[2],label="mixMHC2pred (AUC = %0.4f)" % mix_roc,)
#> [<matplotlib.lines.Line2D object at 0x17e91c910>]
plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
#> [<matplotlib.lines.Line2D object at 0x17e91cb80>]
plt.xlim([0.0, 1.0])
#> (0.0, 1.0)
plt.ylim([0.0, 1.05])
#> (0.0, 1.05)
plt.xlabel("False Positive Rate",fontsize = 14,fontname='Arial')
#> Text(0.5, 0, 'False Positive Rate')
plt.ylabel("True Positive Rate",fontsize = 14)
#> Text(0, 0.5, 'True Positive Rate')
plt.title("")
#> Text(0.5, 1.0, '')
plt.legend(loc="lower right",fontsize = 12)
#> <matplotlib.legend.Legend object at 0x28d2be3a0>
plt.savefig("../figure/BA_model_benchmark.pdf",dpi = 300,transparent=True)
plt.show()

The BA model (AUC: 0.7742) shows better performance than MixMHC2pred (AUC: 0.6482) and has comparable performance with NetMHCIIpan (AUC: 0.7965).

Next we plot the confusion matrix:

from collections import Counter
fig,ax = plt.subplots(figsize = (8,8))
sns.countplot(data = Automate,x = "measurement_value")
#> <AxesSubplot:xlabel='measurement_value', ylabel='count'>
ax.text(-0.1, 10950, "{:.3f}".format(Counter(Automate["measurement_value"])[0]/len(Automate)))
#> Text(-0.1, 10950, '0.500')
ax.text(0.9, 10950, "{:.3f}".format(Counter(Automate["measurement_value"])[1]/len(Automate)))
#> Text(0.9, 10950, '0.500')
plt.ylabel("Number")
#> Text(0, 0.5, 'Number')
plt.xlabel("label")
#> Text(0.5, 0, 'label')
plt.title("The distribution of different label in Automate")
#> Text(0.5, 1.0, 'The distribution of different label in Automate')
plt.savefig("../figure/Automate_label_dis.pdf",dpi = 300,transparent=True)
plt.show()

#python
BA_result["pre_label"] = BA_result["prediction"].map(lambda x : 1 if x>0.426 else 0)
BA_CM = confusion_matrix(BA_result["label"],BA_result["pre_label"])
BA_CM = BA_CM/len(BA_result)
netMHC_result["pre_label"] = netMHC_result["percentile_rank"].map(lambda x : 1 if x <2 else 0)
netMHC_CM = confusion_matrix(netMHC_result["measurement_value"],netMHC_result["pre_label"])
netMHC_CM = netMHC_CM/len(netMHC_result)
mix_result["pre_label"] = mix_result["%Rank_best"].map(lambda x : 1 if x <2 else 0)
mix_CM = confusion_matrix(mix_result["label"],mix_result["pre_label"])
mix_CM = mix_CM/len(mix_result)
import math

def call_metrics(CM):
    tn = CM.ravel()[0]
    fp = CM.ravel()[1]
    fn = CM.ravel()[2]
    tp = CM.ravel()[3]
    se = tp/(tp+fn)
    sp = tn/(tn+fp)
    ppv = tp/(tp+fp)
    npv = tn/(tn+fn)
    f_score= 2*tp/(2*tp+fp+fn)
    DOP =  math.sqrt((se-1)**2+(sp-1)**2+(ppv-1)**2+(npv-1)**2)
    
    return se,sp,ppv,npv,f_score,DOP

call_metrics(BA_CM)
#> (0.6652916912197997, 0.7387634040611453, 0.747352162400706, 0.6552003237555645, 0.7039384807232673, 0.6024881389754401)
call_metrics(netMHC_CM)
#> (0.0754679802955665, 0.993837023510614, 0.9341463414634146, 0.4813177094848552, 0.1396536007292616, 1.0621513360038004)
call_metrics(mix_CM)
#> (0.05004926108374384, 0.9917826980141521, 0.8758620689655172, 0.47403447523456244, 0.09468779123951537, 1.0929427659227045)
fig,ax = plt.subplots(figsize = (8,6))
sns.heatmap(BA_CM, annot=True,cmap="YlGnBu",vmin=0, vmax=0.6)
#> <AxesSubplot:>
plt.ylabel("True label")
#> Text(70.72222222222221, 0.5, 'True label')
plt.xlabel("Prediction label")
#> Text(0.5, 36.72222222222221, 'Prediction label')
plt.title("BA model confusion matrix(threshold:0.426)")
#> Text(0.5, 1.0, 'BA model confusion matrix(threshold:0.426)')
plt.savefig("../figure/BA_CM.pdf",dpi = 300,transparent=True)
plt.show()

#python
fig,ax = plt.subplots(figsize = (8,6))
sns.heatmap(netMHC_CM, annot=True,cmap="YlGnBu",vmin=0, vmax=0.6)
#> <AxesSubplot:>
plt.ylabel("True label")
#> Text(70.72222222222221, 0.5, 'True label')
plt.xlabel("Prediction label")
#> Text(0.5, 36.72222222222221, 'Prediction label')
plt.title("NetMHCIIpan_BA confusion matrix(Rank:10%)")
#> Text(0.5, 1.0, 'NetMHCIIpan_BA confusion matrix(Rank:10%)')
plt.savefig("../figure/netMHCIIpan_CM.pdf",dpi = 300,transparent=True)
plt.show()

#python
fig,ax = plt.subplots(figsize = (8,6))
sns.heatmap(mix_CM,annot =True,cmap="YlGnBu",vmin=0, vmax=0.6)
#> <AxesSubplot:>
plt.ylabel("True label")
#> Text(70.72222222222221, 0.5, 'True label')
plt.xlabel("Prediction label")
#> Text(0.5, 36.72222222222221, 'Prediction label')
plt.title("MixMHC2pred confusion matrix(Rank:10%)")
#> Text(0.5, 1.0, 'MixMHC2pred confusion matrix(Rank:10%)')
plt.savefig("../figure/mixMHC2pred_CM.pdf",dpi = 300,transparent=True)
plt.show()

By using IEDB recommend cutoff 2%rank and 0.426 (500nm, log50k transformed), we found NetMHCIIpan and MixMHC2pred tend to predict the result as negative while our BA model can correctly distinguish between negatives and positives

another dataset

TLimmno2: MHC II immunogenicity prediction network based on transfer learning

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['pdf.fonttype'] = 42
mpl.rcParams['ps.fonttype'] = 42
plt.rcParams.update({'font.family':'Arial'})
import seaborn as sns
import numpy as np
from sklearn.metrics import confusion_matrix,accuracy_score
from sklearn.metrics import roc_curve,auc,confusion_matrix,precision_score,recall_score,precision_recall_curve
import matplotlib as mpl
path = "~/Project/Documenet/TLIMM2pred/data/"
import random

This part focuses on TLimmuno2 model construction. it contains the following sections:

Data preprocessing

Immunogenicity molecular assays were downloaded from the Immune Epitope Database (IEDB) on May 15, 2022. The following keywords were used: Linear peptide; T cell; Include positive and negative; MHC II; Human and any disease. We next used strict criteria for data cleaning. First, data instances without explicit 4-digit MHC alleles were discarded. Second, the length of the peptides is limited in the range of 13-21 mer. Third, the data with low confidence (no information in number of subjects tested/responded) were removed. Finally, we chose five experiment types, “51 chromium”, “ELISA”, “ELISPOT”, “ICS” and “multimer/tetramer” to train model as the reason of we think these are more reliable. The rest of experiment types is used for model benchmark. In the final, we got 3930 positive instances and 2478 negative instances for training, and 105 positive instances and 4 negative instances for benchmark.

#R
IEDB_immuno_MHCII <- read_csv("../data/IEDB_MHCII.csv")
pseduo_seq <- read_table("../data/pseudosequence.2016.all.X.dat",col_names = c("allele","sequence"))
colnames(IEDB_immuno_MHCII) <- IEDB_immuno_MHCII[1,]
IEDB_immuno_MHCII <- IEDB_immuno_MHCII[-1,]
IEDB_immuno_MHCII_filter <- IEDB_immuno_MHCII[,c(12,85,88,102,91,92,93)]  #4:pubmed ID
IEDB_immuno_MHCII_filter$`Qualitative Measure` <- str_replace(IEDB_immuno_MHCII_filter$`Qualitative Measure`,"Positive-High","Positive")
IEDB_immuno_MHCII_filter$`Qualitative Measure` <- str_replace(IEDB_immuno_MHCII_filter$`Qualitative Measure`,"Positive-Intermediate","Positive")
IEDB_immuno_MHCII_filter$`Qualitative Measure` <- str_replace(IEDB_immuno_MHCII_filter$`Qualitative Measure`,"Positive-Low","Positive")
IEDB_immuno_MHCII_filter <- IEDB_immuno_MHCII_filter[!(duplicated(IEDB_immuno_MHCII_filter)),]
IEDB_immuno_MHCII_filter$length <- sapply(IEDB_immuno_MHCII_filter$Description, nchar)
#add method ratio
IEDB_immuno_assay <- IEDB_immuno_MHCII_filter %>% filter(nchar(`Allele Name`)>=14) %>% filter(length>=13&length<=21) %>% group_by(`Method/Technique`) %>% summarise(num = n())
ggplot(data = IEDB_immuno_assay)+
  geom_col(aes(x = reorder(`Method/Technique`,-num),y=num))+xlab("Expirement Type")+ylab("Number") + theme(axis.text.x = element_text(angle=90, hjust=1, vjust=.5))

ggsave("../figure/IEDB_assay_type.pdf")

IEDB_immuno_MHCII_filter1 <- IEDB_immuno_MHCII_filter %>% filter(nchar(`Allele Name`)>=14) %>% filter(length>=13&length<=21) %>% 
  filter(`Method/Technique`=="51 chromium"|`Method/Technique`=="ELISA"|`Method/Technique`=="ELISPOT"|`Method/Technique`=="ICS"|`Method/Technique`=="multimer/tetramer") %>% filter(str_detect(`Allele Name`,"HLA"))
fun <- function(x){
  if (str_detect(x,"\\/")){
    x = str_remove_all(x,"\\*")
    x = str_remove_all(x,"\\:")
    x = str_replace(x,"\\/","-")
  }
  else{
    x = str_replace(x,"HLA-","")
    x = str_replace(x,"\\*","_")
    x = str_replace(x,"\\:","")
  }
  return(x)
}
IEDB_immuno_MHCII_filter1$`Allele Name1` <- sapply(IEDB_immuno_MHCII_filter1$`Allele Name`,fun)
IEDB_immuno_MHCII_filter1 <- left_join(IEDB_immuno_MHCII_filter1,pseduo_seq, by = c("Allele Name1" = "allele"))
IEDB_immuno_MHCII_filter_final <-  na.omit(IEDB_immuno_MHCII_filter1)  #with response
IEDB_immuno_MHCII_filter_final <- IEDB_immuno_MHCII_filter_final[!(duplicated(IEDB_immuno_MHCII_filter_final[,c(1,4)])),]

MHC_immuno_allele <- unique((IEDB_immuno_MHCII_filter_final$`Allele Name1`))
write_lines(MHC_immuno_allele,"../data/IEDB_MHCII_immuno_allele.csv")
write_csv(IEDB_immuno_MHCII_filter_final,"../data/IEDB_MHCII_immuno.csv")
# plt fig
ggplot(data = IEDB_immuno_MHCII_filter_final)+
  geom_histogram(aes(x = `Method/Technique`),stat = "count")+xlab("Expirement Type")+ylab("Number") + labs(title = "IEDB")+ theme(plot.title = element_text(hjust = 0.5))

ggplot(data = IEDB_immuno_MHCII_filter_final)+
  geom_histogram(aes(x = `Qualitative Measure`),stat = "count")+xlab("Label")+ylab("Number") + labs(title = "IEDB")+ theme(plot.title = element_text(hjust = 0.5))

Data for benchmark:

NEW!! These part is deleated in the paper.

#R
IEDB_immuno_MHCII_filter2 <- IEDB_immuno_MHCII_filter %>% filter(nchar(`Allele Name`)>=14) %>% filter(length>=13&length<=21) %>% 
  filter(!(`Method/Technique`=="51 chromium"|`Method/Technique`=="ELISA"|`Method/Technique`=="ELISPOT"|`Method/Technique`=="ICS"|`Method/Technique`=="multimer/tetramer")) %>% 
  filter(str_detect(`Allele Name`,"HLA"))

IEDB_immuno_MHCII_filter2$`Allele Name1` <- sapply(IEDB_immuno_MHCII_filter2$`Allele Name`,fun)
IEDB_immuno_MHCII_filter2 <- left_join(IEDB_immuno_MHCII_filter2,pseduo_seq, by = c("Allele Name1" = "allele"))
IEDB_immuno_MHCII_benchmark <-  na.omit(IEDB_immuno_MHCII_filter2) # only response
IEDB_immuno_MHCII_benchmark <- IEDB_immuno_MHCII_benchmark[!(duplicated(IEDB_immuno_MHCII_benchmark[,c(1,4)])),]
bool1 <- IEDB_immuno_MHCII_benchmark$Description %in% IEDB_immuno_MHCII_filter_final$Description
bool2 <- IEDB_immuno_MHCII_benchmark$`Allele Name` %in% IEDB_immuno_MHCII_filter_final$`Allele Name`
IEDB_immuno_MHCII_benchmark_final <- IEDB_immuno_MHCII_benchmark[!(bool1&bool2),]
#write_csv(IEDB_immuno_MHCII_benchmark_final,"../data/IEDB_benchmark.csv")

We also analyzed the distribution of HLA types in the immunogenicity data and found that it has a similar distribution to the Binding affinity data, and the peptides with a length of 15mer accounted for the largest proportion.

IMM_data = pd.read_csv("/Users/wangguangshuai/Data/MHCII/Data/IMM_model_train/IEDB_MHCII_immuno.csv")
sns.barplot(IMM_data["Qualitative Measure"].value_counts().index,IMM_data["Qualitative Measure"].value_counts().values,palette =["green","red"])
#> <AxesSubplot:>
#> 
#> /Users/wangguangshuai/miniforge3/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
#>   warnings.warn(
plt.xlabel("Label")
#> Text(0.5, 0, 'Label')
plt.ylabel("Number")
#> Text(0, 0.5, 'Number')
plt.savefig("../figure/IMM_label.pdf",dpi = 300,transparent=True)
allele_pos = IMM_data[IMM_data["Qualitative Measure"] == "Positive"]
x = pd.DataFrame(allele_pos["Allele Name1"].value_counts())
x["frac"] = x["Allele Name1"].map(lambda y: y/sum(x["Allele Name1"].values))
a = list(x.index[0:15])
b = list(x["frac"].values[0:15])
a.append("Others")
b.append(sum(x["frac"].values[15:]))
fig,ax = plt.subplots()
plt.bar(a,b,color = "green")
#> <BarContainer object of 16 artists>
plt.ylim([0,0.3])
#> (0.0, 0.3)
plt.yticks([0.00,0.05,0.10,0.15,0.20,0.25])
#> ([<matplotlib.axis.YTick object at 0x178e20d90>, <matplotlib.axis.YTick object at 0x178e054c0>, <matplotlib.axis.YTick object at 0x28e92f940>, <matplotlib.axis.YTick object at 0x17e9b12b0>, <matplotlib.axis.YTick object at 0x17ea0c7f0>, <matplotlib.axis.YTick object at 0x28eea7430>], [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])
plt.xticks(rotation = 90)
#> ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])
plt.savefig("../figure/allele_pos.pdf",dpi = 300,bbox_inches='tight',transparent=True)
plt.show()

allele_neg = IMM_data[IMM_data["Qualitative Measure"] == "Negative"]
x = pd.DataFrame(allele_neg["Allele Name1"].value_counts())
x["frac"] = x["Allele Name1"].map(lambda y: y/sum(x["Allele Name1"].values))
a = list(x.index[0:15])
b = list(x["frac"].values[0:15])
a.append("Others")
b.append(sum(x["frac"].values[15:]))
fig,ax = plt.subplots(figsize = (8,6))
ax = plt.bar(a,b,color = "red")
plt.ylim([0,0.3])
#> (0.0, 0.3)
plt.yticks([0.00,0.05,0.10,0.15,0.20,0.25])
#> ([<matplotlib.axis.YTick object at 0x17e817700>, <matplotlib.axis.YTick object at 0x17e8174c0>, <matplotlib.axis.YTick object at 0x28eea74f0>, <matplotlib.axis.YTick object at 0x28eece610>, <matplotlib.axis.YTick object at 0x28eeced00>, <matplotlib.axis.YTick object at 0x17e081610>], [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])
plt.xticks(rotation = 90)
#> ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])
plt.savefig("../figure/allele_neg.pdf",dpi = 300,transparent=True)
plt.show()

pos_len_value = list(allele_pos["length"].value_counts().values)
pos_len_index = allele_pos["length"].value_counts().index
norm_pos_len_value = allele_pos["length"].value_counts().values/sum(allele_pos["length"].value_counts())
neg_len_value = list(allele_neg["length"].value_counts().values)
neg_len_index = allele_neg["length"].value_counts().index
norm_neg_len_value = allele_neg["length"].value_counts().values/sum(allele_neg["length"].value_counts())
fig,ax = plt.subplots(figsize = (8,6))
width = 0.35
x = pos_len_index
ax.bar(pos_len_index - width/2, norm_pos_len_value,width,label='immunogenic',color = "green")
#> <BarContainer object of 9 artists>
ax.bar(neg_len_index + width/2, norm_neg_len_value, width, label='noimmunogenic',color = "red")
#> <BarContainer object of 9 artists>
ax.set_ylabel('proportion')
#> Text(0, 0.5, 'proportion')
ax.set_xlabel('length')
#> Text(0.5, 0, 'length')
ax.set_title('Distribution of Length')
#> Text(0.5, 1.0, 'Distribution of Length')
ax.set_xticks(x)
#> [<matplotlib.axis.XTick object at 0x28d2e1670>, <matplotlib.axis.XTick object at 0x28d2e1370>, <matplotlib.axis.XTick object at 0x17e80f280>, <matplotlib.axis.XTick object at 0x28d2d4940>, <matplotlib.axis.XTick object at 0x28d2d49d0>, <matplotlib.axis.XTick object at 0x28ed4d4c0>, <matplotlib.axis.XTick object at 0x28d2d46d0>, <matplotlib.axis.XTick object at 0x28e9890a0>, <matplotlib.axis.XTick object at 0x17e827580>]
ax.legend()
#> <matplotlib.legend.Legend object at 0x17e802b50>
plt.savefig("../figure/length.pdf",dpi = 300,transparent=True)
plt.show()

Get random peptide as decoay

In our data, we found that the percentage of data with immunogenic peptides is more than the percentage of data without immunogenicity. The reason for this problem is that the data without response information (number of subjects tested/responded) in the IEDB dataset are also considered as negative peptides and we removed them during data cleaning. To avoid this anomaly, we used all human peptide sequence download from the universal protein knowledgebase (UniProt) database (2022.3) to get random artificial negative peptides. We generated 2 times and 5 times random peptides for each length of each MHC allele as negative instances to construct IEDB-training dataset and IEDB-benchmark dataset.

#python
f=open("../data/uniport_human.fasta")
ls=[]
for line in f:
        if not line.startswith('>'):
                ls.append(line.replace('\n',''))
f.close()
uniport = pd.DataFrame(ls)
uniport["length"] = uniport[0].map(len)
bool = uniport["length"]>30
uniport = uniport[bool]
bool = uniport[0].apply(lambda x: set(list(x)).issubset(set(list("ARNDCQEGHILKMFPSTWYVX"))))
uniport = uniport[bool]
uniport_random = uniport[0].values
def random_select(x,length,num,allele):
    df = []
    all = []
    for i in range(num):
        protein_num = random.randint(0,len(x)-1)
        protein = x[protein_num]
        peptide_num = random.randint(0,len(protein)-length)
        peptide = protein[peptide_num:peptide_num+length]
        df.append(peptide)
        all.append(allele)
    return df,all
IMM_data = pd.read_csv("../data/IEDB_MHCII_immuno.csv")
negative = []
allele = []
time = 2 # negative times, 2 for IEDB-training and 5 for IEDB-benchmark
for i in range(time):
    for i in IMM_data["Allele Name1"].unique():
        smal_IMM = IMM_data[IMM_data["Allele Name1"] == i]["length"].value_counts()
        for l in range(len(smal_IMM)):
            length = int(smal_IMM.index[l])
            num = int(smal_IMM.values[l])
            pep,all = random_select(uniport_random,length,num,i)
            negative.extend(pep)
            allele.extend(all)

IMM_data = IMM_data[["Description","Qualitative Measure","Allele Name1"]]
random_data = pd.DataFrame({"Description":negative,"Allele Name1":allele})
random_data["Qualitative Measure"] = "Negative"
final_data = pd.concat([IMM_data,random_data],axis=0)
final_data["label"] = final_data["Qualitative Measure"].map(lambda x : 1 if x == "Positive" else 0)
final_data = final_data.drop_duplicates()
pseudo_data = pd.read_table("../data/pseudosequence.2016.all.X.dat",header=None,names=["Allele Name1","sequence"])
pseudo_data = pseudo_data.drop_duplicates()
Data = pd.merge(final_data,pseudo_data).drop_duplicates()
#Data.to_csv("../data/IMM_data_random{}.csv".format(time))

Model train

We employed transfer learning to add binding affinity information by extract penultimate layer of BA model. We also used LSTM layers to get antigen and MHC information, because we found our model got a great significant improvement by using this method. We concatenated the three numerical vectors into a single layer, added a dense layer with 400 neurons activated by RELU, a dense layer with 200 neurons activated by RELU and the last layer with a single neuron with sigmoid activation.We named the final model of MHC II immunogenicity prediction network based on transfer learning as “TLimmuno2”.

#PBS
#python
import pandas as pd
import os
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import *
from sklearn.model_selection import train_test_split,KFold
import numpy as np
from collections import Counter
from sklearn.metrics import auc,precision_recall_curve,roc_curve,confusion_matrix,average_precision_score
from collections import Counter
import matplotlib.pyplot as plt
import sys
sys.path.append("/public/slst/home/wanggsh/Project/MHCII/utils")
from Blosum62 import blosum62
#Data process
immuno_data = pd.read_csv("/public/slst/home/wanggsh/Data/MHCII_immuno/IMM_data_random2.csv")
immuno_data["pep_blosum"] = immuno_data["Description"].apply(blosum62,args=(21,))
immuno_data["MHC_blosum"] = immuno_data["sequence"].apply(blosum62,args=(34,))
peptide = np.empty((len(immuno_data),21,21))
for i in range(len(immuno_data)):
    peptide[i] = immuno_data["pep_blosum"][i].reshape((21,21))
MHC = np.empty((len(immuno_data),34,21))
for i in range(len(immuno_data)):
    MHC[i] = immuno_data["MHC_blosum"][i].reshape((34,21))
labels = immuno_data["label"].values
BAmodel = tf.keras.models.load_model("/public/slst/home/wanggsh/Project/Saved_model/new_BA_model/model1")
BAmodel = tf.keras.models.Model(inputs = BAmodel.input,outputs = BAmodel.layers[-2].output)
BA = BAmodel.predict([peptide,MHC])
#class weight
neg, pos = np.bincount(immuno_data["label"])
total = neg + pos
print('Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))
weight_for_0 = (1 / neg) * (total / 2.0)
weight_for_1 = (1 / pos) * (total / 2.0)
class_weight = {0: weight_for_0, 1: weight_for_1}
print('Weight for class 0: {:.2f}'.format(weight_for_0))
print('Weight for class 1: {:.2f}'.format(weight_for_1))
#
pep_Train,pep_Test,MHC_Train,MHC_Test,label_Train,label_Test,BA_Train,BA_Test = train_test_split(peptide,MHC,labels,BA,test_size=0.1,random_state=202205,stratify=labels)
print(Counter(label_Train))
print(Counter(label_Test))
print("Data progress finish!!")

def LSTMmodel():
    pep = tf.keras.Input(shape = (21,21))
    MHC = tf.keras.Input(shape = (34,21))
    BA = tf.keras.Input(shape = (10))

    x = tf.keras.layers.LSTM(256,return_sequences=True)(pep)
    x = tf.keras.layers.LSTM(128,return_sequences=True)(x)
    x = tf.keras.layers.Flatten()(x)
    x = tf.keras.Model(inputs = pep,outputs = x)

    y = tf.keras.layers.LSTM(256,return_sequences=True)(MHC)
    y = tf.keras.layers.LSTM(128,return_sequences=True)(y)
    y = tf.keras.layers.Flatten()(y)
    y = tf.keras.Model(inputs = MHC,outputs = y)

    combined = tf.keras.layers.concatenate([x.output,y.output,BA])

    z = tf.keras.layers.Dense(400,activation="relu")(combined)
    #z = tf.keras.layers.Dropout(0.8)(z)
    z = tf.keras.layers.Dense(200,activation = "relu")(z) #,kernel_regularizer=tf.keras.regularizers.l2(0.0001)
    #z = tf.keras.layers.Dropout(0.8)(z)
    z = tf.keras.layers.Dense(1,activation = "sigmoid")(z)

    model = tf.keras.Model(inputs = [pep,MHC,BA],outputs = z)
    return model
#learning rate
def get_lr_metric(optimizer):
    def lr(y_true, y_pred):
        curLR = optimizer._decayed_lr(tf.float32)
        return curLR 
    return lr

callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)#change
Loss = keras.losses.BinaryCrossentropy()
Optimizer = keras.optimizers.Adam(learning_rate = 0.00008) #
lr_metric = get_lr_metric(Optimizer)
Metrics = [tf.keras.metrics.AUC(),tf.keras.metrics.BinaryAccuracy(),tf.keras.metrics.AUC(curve = "PR"),tf.keras.metrics.Precision()] 
Batch_size= 200
Epochs= 150
Verbose = 2 #reduce information output

#Cross validation
print("Cross Validation")
kf = KFold(n_splits=5,shuffle = True)
ROC = {}
PR = {}
i = 1
for train_index,val_index in kf.split(label_Train):
    pep_train,MHC_train,label_train,BA_train = pep_Train[train_index],MHC_Train[train_index],label_Train[train_index],BA_Train[train_index]
    pep_val,MHC_val,label_val,BA_val = pep_Train[val_index],MHC_Train[val_index],label_Train[val_index],BA_Train[val_index]
    print(Counter(label_train))
    print(Counter(label_val))
    model = LSTMmodel()
    model.compile(
        loss = Loss,
        optimizer = Optimizer,
        metrics = Metrics)
    #callback
    history = model.fit([pep_train,MHC_train,BA_train],label_train,
        batch_size=Batch_size,epochs=Epochs,
        validation_data = ([pep_val,MHC_val,BA_val],label_val),
        verbose = Verbose,
        class_weight=class_weight,callbacks = callback)
    His_df = pd.DataFrame(history.history)
    His_df.to_csv("/public/slst/home/wanggsh/Project/Saved_model/History/immuno/corss_history{0}.csv".format(i))
    prediction = model.predict([pep_val,MHC_val,BA_val])
    fpr,tpr,thresholds = roc_curve(label_val,prediction,pos_label=1)
    area_mine = auc(fpr,tpr)
    ROC["flod{}".format(i)] = [fpr,tpr,thresholds,area_mine]
    print("=============={} auc data finish==========".format(i))
    precision,recall,thresholds = precision_recall_curve(label_val,prediction,pos_label=1)
    area_PR = auc(recall,precision)
    PR["flod{}".format(i)] = [recall,precision,thresholds,area_PR]
    print("=============={} pr data finish==========".format(i))
    i = i+1
#----plot cross fig
ROC_df = pd.DataFrame(ROC)
fig = plt.figure()
lw = 2
ax = fig.add_subplot(111)
ax.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
for i in range(5):
    ax.plot(ROC_df.iloc[0,i],ROC_df.iloc[1,i],label='flod{0} (area = {1:.4f})'.format(i,ROC_df.iloc[3,i]))
ax.set_ylim([0,1.05])
ax.set_xlim([0,1])
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('Receiver operating characteristic example')
ax.legend(loc="lower right")
plt.savefig("/public/slst/home/wanggsh/Project/Saved_model/Model_fig/Immuno_model/ROC_cross.pdf",dpi = 300)
PR_df = pd.DataFrame(PR)
plt.figure()
lw = 2
for i in range(5):
    plt.plot(PR_df.iloc[0,i],PR_df.iloc[1,i],lw=lw, label = 'flod{0} (area = {1:.4f})'.format(i,PR_df.iloc[3,i]))
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('PR curve')
plt.legend(loc="lower right")
plt.show()
plt.savefig("/public/slst/home/wanggsh/Project/Saved_model/Model_fig/Immuno_model/PR_cross.pdf",dpi = 300)
print("-"*20)
print("start model training")

model = LSTMmodel()

model.compile(
        loss = Loss,
        optimizer = Optimizer,
        metrics = Metrics)
#callback
history = model.fit([pep_Train,MHC_Train,BA_Train],label_Train,
        batch_size=Batch_size,epochs=Epochs,
        validation_data = ([pep_Test,MHC_Test,BA_Test],label_Test),
        verbose = Verbose,
        callbacks = callback) #class_weight=class_weight
model.save(r"/public/slst/home/wanggsh/Project/Saved_model/MHCII_immuno/model")

print("start plt fig")
His_df = pd.DataFrame(history.history)
His_df.to_csv("/public/slst/home/wanggsh/Project/Saved_model/History/immuno/history.csv")

prediction = model.predict([pep_Test,MHC_Test,BA_Test])
Train_pre = model.predict([pep_Train,MHC_Train,BA_Train])
fpr,tpr,thresholds = roc_curve(label_Test,prediction,pos_label=1)
train_fpr,train_tpr,_ = roc_curve(label_Train,Train_pre,pos_label=1)
area_mine = auc(fpr,tpr)
area_mine_train = auc(train_fpr,train_tpr)
print(area_mine)
print(area_mine_train)

precision,recall,thresholds = precision_recall_curve(label_Test,prediction,pos_label=1)
precision_train,recall_train,_ = precision_recall_curve(label_Train,Train_pre,pos_label=1)
area_PR = auc(recall,precision)
area_PR_train = auc(recall_train,precision_train)

fig = plt.figure()
lw = 2
ax = fig.add_subplot(111)
ax.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
ax.plot(train_fpr,train_tpr,label='Train AUC = (area = {0:.4f})'.format(area_mine_train))
ax.plot(fpr,tpr,label='Test AUC = (area = {0:.4f})'.format(area_mine))
ax.set_ylim([0,1.05])
ax.set_xlim([0,1])
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('Receiver operating characteristic example')
ax.legend(loc="lower right")
plt.savefig("/public/slst/home/wanggsh/Project/Saved_model/Model_fig/Immuno_model/ROC.pdf",dpi = 300)

plt.figure()
lw = 2
plt.plot(recall_train,precision_train,lw=lw, label = 'Train AUC = (area = {0:.4f})'.format(area_PR_train))
plt.plot(recall,precision,lw=lw, label = 'Test AUC = (area = {0:.4f})'.format(area_PR))
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('PR curve example')
plt.legend(loc="lower right")
plt.show()
plt.savefig("/public/slst/home/wanggsh/Project/Saved_model/Model_fig/Immuno_model/PR.pdf",dpi = 300)
print("Finish")

TLimmuno2 outputs a continuous variable between 0 and 1 and also percentile rank. For each peptide-MHC II pair, percentile ranks were computed based on a dataset of 90,000 13- to 21-mer peptides (10,000 for each length) randomly selected from the human proteome, smaller rank indicates stronger immunogenicity

#PBS
#python

import pandas as pd
import numpy as np
import tensorflow as tf
import sys
sys.path.append("/public/slst/home/wanggsh/Project/MHCII/utils")
from Blosum62 import blosum62
import os
os.environ["CUDA_VISIBLE_DEVICES"]="-1" 
#load_model

BA_model = tf.keras.models.load_model("/public/slst/home/wanggsh/Project/Saved_model/new_BA_model/model1")
BAmodel = tf.keras.models.Model(inputs = BA_model.input,outputs = BA_model.layers[-2].output)
IMM = tf.keras.models.load_model("/public/slst/home/wanggsh/Project/Saved_model/MHCII_immuno/model")

Data = pd.read_csv("/public/slst/home/wanggsh/Data/MHCII_immuno/kmer_data.csv")
Data["pep_blosum"] = Data["pep"].apply(blosum62,args=(21,))
Data["MHC_blosum"] = Data["sequence"].apply(blosum62,args=(34,))
peptide = np.empty((len(Data),21,21))
for i in range(len(Data)):
    peptide[i] = Data["pep_blosum"][i].reshape((21,21))
#mhc
MHC = np.empty((len(Data),34,21))
for i in range(len(Data)):
    MHC[i] = Data["MHC_blosum"][i].reshape((34,21))

BA = BAmodel.predict([peptide,MHC])
IMM_result = IMM.predict([peptide,MHC,BA])
Data["prediction"] = IMM_result

pseudo_seq = pd.read_feather("/public/slst/home/wanggsh/Data/pseudo_blosum62.feather")

#background pep
IMM_bg_pep = pd.read_csv("/public/slst/home/wanggsh/Data/MHCII_immuno/IMM_bg_pep.csv")
IMM_bg_pep["pep_blosum"] = IMM_bg_pep["pep"].apply(blosum62,args=(21,))
DF = pd.DataFrame()
for i in Data["HLA"].unique():
    IMM_bg_pep["MHC"] = i
    x = pd.merge(IMM_bg_pep,pseudo_seq[["MHC","MHC_blosum"]])
    peptide = np.empty((len(x),21,21))
    for z in range(len(x)):
        peptide[z] = x["pep_blosum"][z].reshape((21,21))
    MHC = np.empty((len(x),34,21))
    for z in range(len(x)):
        MHC[z] = x["MHC_blosum"][z].reshape((34,21))
    BA = BAmodel.predict([peptide,MHC])
    IMM_result = IMM.predict([peptide,MHC,BA])
    IMM_result = IMM_result.tolist()
    y = Data[Data["HLA"]== i]
    Rank = []
    for I in y["prediction"].values:
        IMM_result.append(I)
        rank = 1-(sorted(IMM_result).index(IMM_result[-1])+1)/90001
        Rank.append(rank)
        IMM_result.pop()
    y["Rank"] = Rank
    DF = pd.concat([DF,y])
DF.pop("pep_blosum")
DF.pop("MHC_blosum")
DF.to_csv("/public/slst/home/wanggsh/Data/MHCII_immuno/kmer_rank_result.csv")
print("finish") 

Model perfoemance

We first divided the data set into training and test sets by the ratio of 9:1 for model training and testing. To verify the stability of the model, 5-fold cross-validation was applied in the training process of TLimmuno2, and TLimmuno2 shows highly stable AUC (average 0.8727) across the validations . Overall, TLimmuno2 reaches AUC=0.9909 in training dataset and AUC=0.8648 in the test dataset .

cross ROC

cross PR

ROC

PR

model test dataset

data = pd.read_csv("../data/model_test.csv")
data["len"] = data["peptide"].map(len)
data2 = data[data["len"] >= 15]
data2["peptide"].to_csv("../data/model_test_pep.csv",index = None,header = None)
oridata = pd.read_csv("../data/IEDB_MHCII_immuno.csv")
data1 = pd.merge(data,oridata[["Allele Name","Allele Name1"]],left_on = "MHC",right_on = "Allele Name1").drop_duplicates()
data1.to_csv("../data/model_test_netMHCIIpan.csv",index = None)
#pbs
#python
import pandas as pd
import os
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import *
from sklearn.model_selection import train_test_split,KFold
import numpy as np
from collections import Counter
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
from sklearn.metrics import auc,precision_recall_curve,roc_curve,confusion_matrix,average_precision_score
from collections import Counter
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['pdf.fonttype'] = 42
mpl.rcParams['ps.fonttype'] = 42
plt.rcParams.update({'font.family':'Arial'})
import sys
sys.path.append("/public/slst/home/wanggsh/Project/MHCII/utils")
from Blosum62 import blosum62
#Data process
immuno_data = pd.read_csv("/public/slst/home/wanggsh/Data/MHCII_immuno/IMM_data_random2.csv")

immuno_data["pep_blosum"] = immuno_data["Description"].apply(blosum62,args=(21,))
immuno_data["MHC_blosum"] = immuno_data["sequence"].apply(blosum62,args=(34,))
peptide = np.empty((len(immuno_data),21,21))
for i in range(len(immuno_data)):
    peptide[i] = immuno_data["pep_blosum"][i].reshape((21,21))
#mhc 
MHC = np.empty((len(immuno_data),34,21))
for i in range(len(immuno_data)):
    MHC[i] = immuno_data["MHC_blosum"][i].reshape((34,21))


labels = immuno_data["label"].values

BAmodel = tf.keras.models.load_model("/public/slst/home/wanggsh/Project/Saved_model/new_BA_model/model1")
BAmodel = tf.keras.models.Model(inputs = BAmodel.input,outputs = BAmodel.layers[-2].output)
BA = BAmodel.predict([peptide,MHC])

#class weight
neg, pos = np.bincount(immuno_data["label"])
total = neg + pos
print('Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))

weight_for_0 = (1 / neg) * (total / 2.0)
weight_for_1 = (1 / pos) * (total / 2.0)

class_weight = {0: weight_for_0, 1: weight_for_1}

print('Weight for class 0: {:.2f}'.format(weight_for_0))
print('Weight for class 1: {:.2f}'.format(weight_for_1))


pep_Train,pep_Test,MHC_Train,MHC_Test,label_Train,label_Test,BA_Train,BA_Test = train_test_split(peptide,MHC,labels,BA,test_size=0.1,random_state=202205,stratify=labels)
print(Counter(label_Train))
print(Counter(label_Test))
print("Data progress finish!!")




# Cross validation
print("Cross Validation")



model = tf.keras.models.load_model("/public/slst/home/wanggsh/Project/Saved_model/MHCII_immuno/model")


prediction = model.predict([pep_Test,MHC_Test,BA_Test])
Train_pre = model.predict([pep_Train,MHC_Train,BA_Train])
fpr,tpr,thresholds = roc_curve(label_Test,prediction,pos_label=1)
train_fpr,train_tpr,_ = roc_curve(label_Train,Train_pre,pos_label=1)
area_mine = auc(fpr,tpr)
area_mine_train = auc(train_fpr,train_tpr)
print(area_mine)
print(area_mine_train)

precision,recall,thresholds = precision_recall_curve(label_Test,prediction,pos_label=1)
precision_train,recall_train,_ = precision_recall_curve(label_Train,Train_pre,pos_label=1)
area_PR = auc(recall,precision)
area_PR_train = auc(recall_train,precision_train)


fig,ax = plt.subplots(figsize = (8,8))
lw = 2
ax.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
ax.plot(train_fpr,train_tpr,label='Train dataset (AUC = {0:.4f})'.format(area_mine_train))
ax.plot(fpr,tpr,label='Test dataset (AUC = {0:.4f})'.format(area_mine))
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate",fontsize = 14)
plt.ylabel("True Positive Rate",fontsize = 14)
plt.title("")
plt.legend(loc="lower right",fontsize = 12)
plt.savefig("/public/slst/home/wanggsh/Project/Saved_model/Model_fig/Immuno_model/ROC_1121.pdf",dpi = 300,transparent=True)

fig,ax = plt.subplots(figsize = (8,8))
lw = 2
ax.plot(recall_train,precision_train,lw=lw, label = 'Train dataset (AUC = {0:.4f})'.format(area_PR_train))
ax.plot(recall,precision,lw=lw, label = 'Test dataset (AUC = {0:.4f})'.format(area_PR))
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Recall',fontsize = 14)
plt.ylabel('Precision',fontsize = 14)
plt.title("")
plt.legend(loc="lower right",fontsize = 12)
plt.savefig("/public/slst/home/wanggsh/Project/Saved_model/Model_fig/Immuno_model/PR_1121.pdf",dpi = 300,transparent=True)

def auc_pr(data,true_label,prediction,rank = False):
    if rank:
        fpr,tpr,_ = roc_curve(data[true_label],1-data[prediction])
        precision,recall,_ = precision_recall_curve(data[true_label],1-data[prediction])
    else:
        fpr,tpr,_ = roc_curve(data[true_label],data[prediction])
        precision,recall,_ = precision_recall_curve(data[true_label],data[prediction])
        
    AUC = auc(fpr,tpr)
    PR = auc(recall,precision)
    
    return fpr,tpr,recall,precision,AUC,PR

data = pd.read_csv("/public/slst/home/wanggsh/Data/new/model_test.csv")
test_iedb = pd.read_csv("/public/slst/home/wanggsh/Data/new/model_test_iedb.csv")
test_iedb = pd.merge(data,test_iedb,left_on = "peptide",right_on = "Peptide")
test_iedb_result = auc_pr(test_iedb,"Label","Immunogenicity Score",rank = True)
test_repitope = pd.read_csv("/public/slst/home/wanggsh/Data/new/model_test_repitope.csv")
test_repitope = pd.merge(data,test_repitope,left_on = "peptide",right_on = "Peptide")
test_repitope_result = auc_pr(test_repitope,"Label","ImmunogenicityScore")
test_net_ba = pd.read_csv("/public/slst/home/wanggsh/Data/new/model_test_ba.csv")
test_net_ba = pd.merge(data,test_net_ba,left_on = "peptide",right_on = "peptide")
test_net_ba_result = auc_pr(test_net_ba,"Label","percentile_rank",rank = True)
test_net_el = pd.read_csv("/public/slst/home/wanggsh/Data/new/model_test_el.csv")
test_net_el = pd.merge(data,test_net_el,left_on = "peptide",right_on = "peptide")
test_net_el_result = auc_pr(test_net_el,"Label","percentile_rank",rank = True)

fig,ax = plt.subplots(figsize = (8,8))
lw = 2
ax.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
ax.plot(fpr,tpr,label='TLimmuno2 : (AUC = {0:.4f})'.format(area_mine))
ax.plot(test_iedb_result[0],test_iedb_result[1],label='IEDB : (AUC={0:.4f})'.format(test_iedb_result[4]),linewidth = 2)
ax.plot(test_repitope_result[0],test_repitope_result[1],label='Repitope : (AUC={0:.4f})'.format(test_repitope_result[4]),linewidth = 2)
ax.plot(test_net_ba_result[0],test_net_ba_result[1],label='NetMHCIIpan_ba : (AUC={0:.4f})'.format(test_net_ba_result[4]),linewidth = 2)
ax.plot(test_net_el_result[0],test_net_el_result[1],label='NetMHCIIpan_el : (AUC={0:.4f})'.format(test_net_el_result[4]),linewidth = 2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate",fontsize = 14)
plt.ylabel("True Positive Rate",fontsize = 14)
plt.title("")
plt.legend(loc="lower right",fontsize = 12)
plt.savefig("/public/slst/home/wanggsh/Data/new/test_com_ROC_1121.pdf",dpi = 300,transparent=True)

fig,ax = plt.subplots(figsize = (8,8))
lw = 2
ax.plot(recall,precision,lw=lw, label = 'Test dataset (AUC = {0:.4f})'.format(area_PR))
ax.plot(test_iedb_result[2],test_iedb_result[3],label='IEDB : (AUC={0:.4f})'.format(test_iedb_result[5]),linewidth = 2)
ax.plot(test_repitope_result[2],test_repitope_result[3],label='Repitope : (AUC={0:.4f})'.format(test_repitope_result[5]),linewidth = 2)
ax.plot(test_net_ba_result[2],test_net_ba_result[3],label='NetMHCIIpan_ba : (AUC={0:.4f})'.format(test_net_ba_result[5]),linewidth = 2)
ax.plot(test_net_el_result[2],test_net_el_result[3],label='NetMHCIIpan_el : (AUC={0:.4f})'.format(test_net_el_result[5]),linewidth = 2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Recall',fontsize = 14)
plt.ylabel('Precision',fontsize = 14)
plt.title("")
plt.legend(loc="lower right",fontsize = 12)
plt.savefig("/public/slst/home/wanggsh/Data/new/test_com_PR_1121.pdf",dpi = 300,transparent=True)
print("Finish")

TLimmuno2 outperforms existing methods in peptide-MHC II immunogenicity prediction

This part focuses on another independent dataset to compare the performance of TLimmuno2. it contains the following sections:

Benchmark in immunoigenicity

The IEDB-Benchmark dataset is from the same database as the training set, thus they may have similar distribution patterns, which may lead to systematic bias. To perform an unbiased performance evaluation in predicting immunogenic peptides, we retrieved published papers to collect neoepitopes that were tested experimentally for CD4+ T cell immunogenicity.

Data proprecessing

#python
peptide = pd.read_csv("../data/nature_peptide.csv")
HLA = pd.read_csv("../data/nature_HLA.csv")
peptide = peptide.drop_duplicates()
peptide["Immunogenicity"].value_counts()
#> 0    849
#> 1     80
#> Name: Immunogenicity, dtype: int64
peptide["Len"] = peptide["Peptide"].map(len)
peptide[peptide["Len"]>=15]["Peptide"].to_csv("../data/pep_only.txt",header = None,index = None,sep = " ")#chain indexing
IEDB_pep = peptide[peptide["Len"]>=15].reset_index(drop = True)
IEDB_pep["num"] = [x+1 for x in range(len(IEDB_pep))]#used to predict by IEDB tools
pep_ID = ["Pep{}".format(x+1) for x in range(len(peptide))]
peptide["pep_ID"] = pep_ID
fig,ax = plt.subplots()
sns.countplot(data = peptide,x = "Len")
#> <AxesSubplot:xlabel='Len', ylabel='count'>
plt.show()

#python
ax = sns.countplot(data=peptide, x="Immunogenicity")
#ax.bar_label(ax.containers[0])
ax.text(-0.1, 860, "{:.3f}".format(851/931))
#> Text(-0.1, 860, '0.914')
ax.text(0.9, 90, "{:.3f}".format(80/931))
#> Text(0.9, 90, '0.086')
plt.savefig("../figure/nature_immuno_rate.pdf",dpi = 300,transparent=True)
plt.show()

Neoepitopes tested experimentally were usually sequences of 20–25 amino acids, so we use 15-mer and k-mer methods (Methods) to assign a rank value to a given neoepitope.

For the 15mer method, we split one neoepitope sequence into 15 subsequences and score or rank each subsequence. In the kmer method, neoepitopes were cut into 13-21-mers subsequence. The prediction of the neoepitope is the maximum score or the minimum ranking of the predicted results of all subsequences by different methods. In case of a neoepitope with multiple MHC types, we used the maximum score or minimum rank of prediction for different MHC types as the prediction result for the neoepitope.

#python
Data = pd.merge(HLA,peptide)
IMM_train_data_allele = pd.read_csv("../data/IEDB_MHCII_immuno_allele.csv",header = None,names = ["HLA"])
Data = Data[Data["HLA"].isin(IMM_train_data_allele["HLA"])]
pseudo_seq = pd.read_table("../data/pseudosequence.2016.all.X.dat",header = None,names = ["HLA","sequence"])
Data1 = pd.merge(Data,pseudo_seq)
Data1["length"] = Data1["Peptide"].map(len)
Data1 = Data1[Data1["length"]>=13]
Data1 = Data1.reset_index(drop = True)
#15mer
def mer15(Pep):
    P = []
    Length = 15
    for i in range(len(Pep) - Length +1):
        pep = Pep[i:i+Length]
        P.append(pep)
    return P
DF15 = pd.DataFrame()
for i in range(len(Data1)):
    pep = mer15(Data1["Peptide"][i])
    df = pd.DataFrame(pep)
    df["pep_ID"] = Data1["pep_ID"][i]
    df["HLA"] = Data1["HLA"][i]
    DF15 = pd.concat([DF15,df])
DF15.columns = ["pep","pep_ID","HLA"]
benchmark_data = pd.merge(DF15,pseudo_seq)
benchmark_data.to_csv("../data/15mer_data.csv")
benchmark_data["pep"].to_csv("../data/15mer_pep_only.txt",header = None,index = None,sep = " ")
#kmer
def kmer(Pep,sa,end):
    P = []
    R = range(sa,end+1)
    for i in R:
        Length = i
        for i in range(len(Pep) - Length +1):
            pep = Pep[i:i+Length]
            P.append(pep)
    return P
DF_k = pd.DataFrame()
for i in range(len(Data1)):
    pep = kmer(Data1["Peptide"][i],13,21)
    df = pd.DataFrame(pep)
    df["pep_ID"] = Data1["pep_ID"][i]
    df["HLA"] = Data1["HLA"][i]
    DF_k = pd.concat([DF_k,df])
DF_k.columns = ["pep","pep_ID","HLA"]
benchmark_data = pd.merge(DF_k,pseudo_seq)
benchmark_data.to_csv("../data/kmer_data.csv")

Different tools

We also compared the performance of TLimmuno2 with other immunogenicity prediction tools, including IEDB tool (CD4episcore) and Repitope. Note some published methods, such as Deepitope and FIONA do not have the tools publicly available, and are thus not included in the subsequent comparison analysis. we can’t link their website for FIONA and we can’t find the Github repositories for Deepitope.

IEDB online tool(CD4episcore)

We run IEDB in their website IEDB online tool, and this tool only support 15mer peptides.

Repitope

You can find the Repitope in their Github, This tool is very slow and potentially problematic on larger datasets, so we only used 15mer peptides.

#PBS
#R
library(tidyverse)
library(data.table)
library(Repitope)
options(java.parameters="-Xmx60G")

fragLibDT <- fst::read_fst("~/Project/MHCII/Repitope/FragmentLibrary_TCRSet_Public_RepitopeV3.fst", as.data.table=T)
featureDF_MHCII <- fst::read_fst("~/Project/MHCII/Repitope/FeatureDF_MHCII_Weighted.10000_RepitopeV3.fst", as.data.table=T)

data <- read_csv("~/Project/MHCII/Repitope/15mer_data.csv")
peptideSet_of_interest <- data$pep
peptideSet=peptideSet_of_interest
len = sapply(data$pep, nchar)
table(len)

res_MHCII <- EpitopePrioritization(
  featureDF=featureDF_MHCII[Peptide%in%MHCII_Human$Peptide,], 
  metadataDF=MHCII_Human[,.(Peptide,Immunogenicity)],
  peptideSet=peptideSet_of_interest,
  peptideLengthSet=11:30,
  fragLib=fragLibDT,
  aaIndexIDSet="all",
  fragLenSet=3:11,
  fragDepth=10000,
  fragLibType="Weighted",
  featureSet=MHCII_Human_MinimumFeatureSet,
  seedSet=1:1,
  coreN=1,
  outDir="~/Project/MHCII/Repitope/pre_result"  ## Intermediate and final output files will be stored under this directory
)

netMHCIIpan BA&EL

This part code is similar to the IEDB-benchmark dataset, only change the input and output file, so we don’t add code in this part.

TLimmuno2

This part code is similar to the IEDB-benchmark dataset, only change the input and output file, so we don’t add code in this part.

Result

#python
def IMM_process(Data,peptide):
    sam_pep_ID = Data["pep_ID"].unique()
    res = []
    p_ID = []
    Data["length"] = Data["pep"].map(len)
    for a in sam_pep_ID:
        Mean = []    
        ID = Data[Data["pep_ID"] == a]
        for i in ID["HLA"].unique():
            Data_HLA = ID[ID["HLA"] == i]
            prediction = []
            for l in Data_HLA["length"].unique():
                Data_len = Data_HLA[Data_HLA["length"] == l]
                pre = Data_len["prediction"].max()
                prediction.append(pre)
            x = np.max(prediction)
            Mean.append(x)
        res.append(max(Mean))
        p_ID.append(a)
    result = pd.DataFrame({"pep_ID":p_ID,"prediction":res})
    result = result.merge(peptide,how = "inner",on = "pep_ID" )
    return result
  
def auc_pr(data,true_label,prediction,rank = False):
    if rank:
        fpr,tpr,_ = roc_curve(data[true_label],1-data[prediction])
        precision,recall,_ = precision_recall_curve(data[true_label],1-data[prediction])
    else:
        fpr,tpr,_ = roc_curve(data[true_label],data[prediction])
        precision,recall,_ = precision_recall_curve(data[true_label],data[prediction])
        
    AUC = auc(fpr,tpr)
    PR = auc(recall,precision)
    
    return fpr,tpr,recall,precision,AUC,PR
result = pd.read_csv("../data/15mer_result.csv")
result_15mer = IMM_process(result,peptide)
mer15_result = auc_pr(result_15mer,"Immunogenicity","prediction")
print("AUC:{}".format(mer15_result[4]))
#> AUC:0.7332489451476794
print("PR_AUC:{}".format(mer15_result[5]))
#> PR_AUC:0.198391863815141
#python
result = pd.read_csv("../data/kmer_result.csv")
result_kmer = IMM_process(result,peptide)
result_kmer = result_kmer[result_kmer["pep_ID"].isin(result_15mer["pep_ID"])]
kmer_result = auc_pr(result_kmer,"Immunogenicity","prediction")
print("AUC:{}".format(kmer_result[4]))
#> AUC:0.7392635212888378
print("PR_AUC:{}".format(kmer_result[5]))
#> PR_AUC:0.20210218456922854
#python
IEDB_res = pd.read_csv("../data/IEDB_CD4_tools_manually.csv")
Imm_score = []
pro_num = []
for i in IEDB_res["Protein Number"].unique():
    y = IEDB_res[IEDB_res["Protein Number"] == i]
    imm_score = y["Immunogenicity Score"].mean()
    pro_num.append(i)
    Imm_score.append(imm_score)
IEDB_result = pd.DataFrame({"num":pro_num,"score":Imm_score})
IEDB_result = pd.merge(IEDB_result,IEDB_pep)
IEDB_result = IEDB_result[IEDB_result["Peptide"].isin(result_15mer["Peptide"])]
IEDB_score = auc_pr(IEDB_result,"Immunogenicity","score",rank = True)
print("AUC:{}".format(IEDB_score[4]))
#> AUC:0.608331415420023
print("PR_AUC:{}".format(IEDB_score[5]))
#> PR_AUC:0.13176952789635926
#python
Repitope_result = pd.read_csv("../data/repitope_15mer_result.csv")
Repitope = pd.merge(Repitope_result,DF15,left_on = "Peptide",right_on = "pep")
sam_pep_ID = Repitope["pep_ID"].unique()
res = []
p_ID = []
for a in sam_pep_ID:
    Mean = []
    ID = Repitope[Repitope["pep_ID"] == a]
    for i in ID["HLA"].unique():
        x = ID[ID["HLA"] == i]
        x_mean = x["ImmunogenicityScore"].mean()
        Mean.append(x_mean)
    res.append(max(Mean))
    p_ID.append(a)
dic = {"pep_ID":p_ID,"ImmunogenicityScore":res}
Repitope_pre = pd.DataFrame(dic)
Repitope_result = Repitope_pre.merge(peptide,how = "inner",on = "pep_ID" )
Repitope_score = auc_pr(Repitope_result,"Immunogenicity","ImmunogenicityScore")
print("AUC:{}".format(Repitope_score[4]))
#> AUC:0.48418872266973534
print("PR_AUC:{}".format(Repitope_score[5]))
#> PR_AUC:0.08742710133794984
#python
def netMHCIIpan_process(result,allele,ori_sorted,peptide):
    
    #combine two type allele
    result = pd.merge(result,allele,left_on="allele",right_on="Allele Name")
    result_sorted = result.sort_values(by=["peptide","HLA"]).reset_index(drop = True)
    #combine with ori data
    result_combined = pd.merge(result_sorted,ori_sorted[["pep","pep_ID"]],left_on = "peptide",right_on ="pep")
    #result_combined["pep_length"] = result_combined["peptide"].map(len)
    sam_pep_ID = result_combined["pep_ID"].unique()
    res = []
    p_ID = []
    for a in sam_pep_ID:
        Mean = []
        ID = result_combined[result_combined["pep_ID"] == a]
        #get Mean of all splited pep and Max of all HLA
        for i in ID["Allele Name"].unique():
            Data_HLA = ID[ID["Allele Name"] == i]
            prediction = []
            for l in Data_HLA["pep_length"].unique():
                Data_len = Data_HLA[Data_HLA["pep_length"] == l]
                pre = Data_len["percentile_rank"].min()
                prediction.append(pre)
            x = np.min(prediction)
        res.append(np.min(x))
        p_ID.append(a) 
    final_result = pd.DataFrame({"pep_ID":p_ID,"prediction":res})
    final_result = final_result.merge(peptide,how = "inner",on = "pep_ID" )
    return final_result 
DF15_sorted = DF15.sort_values(by=["pep","HLA"]).reset_index(drop = True)
allele = pd.read_csv("/Users/wangguangshuai/Data/MHCII/Data/IMM_model_benchmark/allele.csv")
netMHCIIpan_ba = pd.read_csv("/Users/wangguangshuai/Data/MHCII/Data/IMM_model_benchmark/netMHCIIpan_15_mer_ba.csv")
netMHCIIpan_ba["pep_length"] = netMHCIIpan_ba["peptide"].map(len)
net_15mer_ba = netMHCIIpan_process(netMHCIIpan_ba,allele,DF15_sorted,peptide)
mer15_ba_result = auc_pr(net_15mer_ba,"Immunogenicity","prediction",rank = True)
print("AUC:{}".format(mer15_ba_result[4]))
#> AUC:0.7059071729957805
print("PR_AUC:{}".format(mer15_ba_result[5]))
#> PR_AUC:0.19857401640806183
netMHCIIpan_el = pd.read_csv("/Users/wangguangshuai/Data/MHCII/Data/IMM_model_benchmark/netMHCIIpan_15_mer_el.csv")
netMHCIIpan_el["pep_length"] = netMHCIIpan_el["peptide"].map(len)
net_15mer_el = netMHCIIpan_process(netMHCIIpan_el,allele,DF15_sorted,peptide)
mer15_el_result = auc_pr(net_15mer_el,"Immunogenicity","prediction",rank = True)
print("AUC:{}".format(mer15_el_result[4]))
#> AUC:0.6991791331031837
print("PR_AUC:{}".format(mer15_el_result[5]))
#> PR_AUC:0.16379884083064927
DF_k_sorted = DF_k.sort_values(by=["pep","HLA"]).reset_index(drop = True)
netMHCIIpan_ba_k = pd.read_csv("/Users/wangguangshuai/Data/MHCII/Data/IMM_model_benchmark/netMHCIIpan_k_mer_ba.csv")
netMHCIIpan_ba_k["pep_length"] =netMHCIIpan_ba_k["peptide"].map(len)
net_kmer_ba = netMHCIIpan_process(netMHCIIpan_ba_k,allele,DF_k_sorted,peptide)
kmer_ba_result = auc_pr(net_kmer_ba,"Immunogenicity","prediction",rank = True)
netMHCIIpan_el_k = pd.read_csv("../data/netMHCIIpan_k_mer_el.csv")
netMHCIIpan_el_k["pep_length"] =netMHCIIpan_el_k["peptide"].map(len)
net_kmer_el = netMHCIIpan_process(netMHCIIpan_el_k,allele,DF_k_sorted,peptide)
kmer_el_result = auc_pr(net_kmer_el,"Immunogenicity","prediction",rank = True)
print("BA AUC: {}".format(kmer_ba_result[4]))
#> BA AUC: 0.7056808778005946
print("BA PR_AUC: {}".format(kmer_ba_result[5]))
#> BA PR_AUC: 0.1910149930669585
print("EL AUC: {}".format(kmer_el_result[4]))
#> EL AUC: 0.6999570907530572
print("EL PR_AUC: {}".format(kmer_el_result[5]))
#> EL PR_AUC: 0.1594417550386672
kmer_data = pd.read_csv("../data/kmer_data.csv")
kmer_data["pep"].to_csv("../data/kmer_pep_only.csv",header = None,index = None)

In these comparisons, TLimmuno2 shows the best performance both in the 15-mer method and the k-mer method. Repitope and IEDB tools evaluate epitope immunogenicity using population MHC information, not considering the personalized MHC allele information, and have a poor performance in these comparisons.

#python
fig,ax = plt.subplots(figsize = (8,8))
lw = 2
ax.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
#> [<matplotlib.lines.Line2D object at 0x17871a700>]
ax.plot(mer15_result[0],mer15_result[1],label='TLimmuno2 : (AUC={0:.4f})'.format(mer15_result[4]), lw=lw)
#> [<matplotlib.lines.Line2D object at 0x17871aa60>]
ax.plot(IEDB_score[0],IEDB_score[1],label='IEDB : (AUC={0:.4f})'.format(IEDB_score[4]), lw=lw)
#> [<matplotlib.lines.Line2D object at 0x17871abe0>]
ax.plot(Repitope_score[0],Repitope_score[1],label='Repitope : (AUC={0:.4f})'.format(Repitope_score[4]), lw=lw)
#> [<matplotlib.lines.Line2D object at 0x17871af70>]
ax.plot(mer15_ba_result[0],mer15_ba_result[1],label='NetMHCIIpan_ba : (AUC={0:.4f})'.format(mer15_ba_result[4]), lw=lw)
#> [<matplotlib.lines.Line2D object at 0x17871a550>]
ax.plot(mer15_el_result[0],mer15_el_result[1],label='NetMHCIIpan_el : (AUC={0:.4f})'.format(mer15_el_result[4]), lw=lw)
#> [<matplotlib.lines.Line2D object at 0x17871a1c0>]
ax.set_ylim([0,1.05])
#> (0.0, 1.05)
ax.set_xlim([0,1])
#> (0.0, 1.0)
ax.set_xlabel('False Positive Rate',fontsize = 14)
#> Text(0.5, 0, 'False Positive Rate')
ax.set_ylabel('True Positive Rate',fontsize = 14)
#> Text(0, 0.5, 'True Positive Rate')
ax.legend(loc="lower right",fontsize = 12)
#> <matplotlib.legend.Legend object at 0x178704ee0>
plt.savefig("../figure/benchmark_15mer.pdf",dpi = 300,transparent=True)
plt.show()

#python
fig,ax = plt.subplots(figsize = (8,8))
lw = 2
ax.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
#> [<matplotlib.lines.Line2D object at 0x178744700>]
ax.plot(kmer_result[0],kmer_result[1],label='TLimmuno2 : (AUC={0:.4f})'.format(kmer_result[4]))
#> [<matplotlib.lines.Line2D object at 0x178744820>]
ax.plot(kmer_ba_result[0],kmer_ba_result[1],label='NetMHCIIpan_ba : (AUC={0:.4f})'.format(kmer_ba_result[4]))
#> [<matplotlib.lines.Line2D object at 0x178744d60>]
ax.plot(kmer_el_result[0],kmer_el_result[1],label='NetMHCIIpan_el : (AUC={0:.4f})'.format(kmer_el_result[4]))
#> [<matplotlib.lines.Line2D object at 0x178ba7f10>]
ax.set_ylim([0,1.05])
#> (0.0, 1.05)
ax.set_xlim([0,1])
#> (0.0, 1.0)
ax.set_xlabel('False Positive Rate',fontsize = 14)
#> Text(0.5, 0, 'False Positive Rate')
ax.set_ylabel('True Positive Rate',fontsize = 14)
#> Text(0, 0.5, 'True Positive Rate')
ax.legend(loc="lower right",fontsize = 12)
#> <matplotlib.legend.Legend object at 0x17878cb50>
plt.savefig("../figure/benchmark_kmer.pdf",dpi = 300,transparent=True)
plt.show()

#python
fig,ax = plt.subplots(figsize = (8,8))
lw = 2
ax.plot(mer15_result[2],mer15_result[3],label='TIMM2pred : (AUC={0:.4f})'.format(mer15_result[5]))
#> [<matplotlib.lines.Line2D object at 0x17e80f400>]
ax.plot(IEDB_score[2],IEDB_score[3],label='IEDB : (AUC={0:.4f})'.format(IEDB_score[5]))
#> [<matplotlib.lines.Line2D object at 0x28e9891c0>]
ax.plot(Repitope_score[2],Repitope_score[3],label='Repitope : (AUC={0:.4f})'.format(Repitope_score[5]))
#> [<matplotlib.lines.Line2D object at 0x28e9895b0>]
ax.plot(mer15_ba_result[2],mer15_ba_result[3],label='NetMHCIIpan_ba : (AUC={0:.4f})'.format(mer15_ba_result[5]))
#> [<matplotlib.lines.Line2D object at 0x28e989130>]
ax.plot(mer15_el_result[2],mer15_el_result[3],label='NetMHCIIpan_el : (AUC={0:.4f})'.format(mer15_el_result[5]))
#> [<matplotlib.lines.Line2D object at 0x28e989070>]
plt.xlim([0.0, 1.0])
#> (0.0, 1.0)
plt.ylim([0.0, 1.05])
#> (0.0, 1.05)
plt.xlabel('Recall',fontsize = 14)
#> Text(0.5, 0, 'Recall')
plt.ylabel('Precision',fontsize = 14)
#> Text(0, 0.5, 'Precision')
plt.legend(loc="upper right",fontsize = 12)
#> <matplotlib.legend.Legend object at 0x17ea0b640>
plt.savefig("../figure/15mer_PR.pdf",dpi = 300,transparent=True)
plt.show()

#python
fig,ax = plt.subplots(figsize = (8,8))
lw = 2
ax.plot(kmer_result[2],kmer_result[3],label='TIMM2pred : (AUC={0:.4f})'.format(kmer_result[5]))
#> [<matplotlib.lines.Line2D object at 0x1787d52b0>]
ax.plot(kmer_ba_result[2],kmer_ba_result[3],label='NetMHCIIpan_ba : (AUC={0:.4f})'.format(kmer_ba_result[5]))
#> [<matplotlib.lines.Line2D object at 0x17e0f5fa0>]
ax.plot(kmer_el_result[2],kmer_el_result[3],label='NetMHCIIpan_el : (AUC={0:.4f})'.format(kmer_el_result[5]))
#> [<matplotlib.lines.Line2D object at 0x17e0f57f0>]
plt.xlim([0.0, 1.0])
#> (0.0, 1.0)
plt.ylim([0.0, 1.05])
#> (0.0, 1.05)
plt.xlabel('Recall',fontsize = 14)
#> Text(0.5, 0, 'Recall')
plt.ylabel('Precision',fontsize = 14)
#> Text(0, 0.5, 'Precision')
plt.legend(loc="upper right",fontsize = 12)
#> <matplotlib.legend.Legend object at 0x1787b7e80>
plt.savefig("../figure/kmer_pr.pdf",dpi = 300,transparent=True)
plt.show()

Confusion matrix

result_15mer["pre_label"] = result_15mer["prediction"].map(lambda x : 1 if x>0.5 else 0)
result_15mer_CM = confusion_matrix(result_15mer["Immunogenicity"],result_15mer["pre_label"])
result_15mer_CM = result_15mer_CM/len(result_15mer)
result_kmer["pre_label"] = result_kmer["prediction"].map(lambda x : 1 if x>0.5 else 0)
result_kmer_CM = confusion_matrix(result_kmer["Immunogenicity"],result_kmer["pre_label"])
result_kmer_CM = result_kmer_CM/len(result_kmer)
IEDB_result["pre_label"] = IEDB_result["score"].map(lambda x : 1 if x>50 else 0)
IEDB_result_CM = confusion_matrix(IEDB_result["Immunogenicity"],IEDB_result["pre_label"])
IEDB_result_CM = IEDB_result_CM/len(IEDB_result)
Repitope_result["pre_label"] = Repitope_result["ImmunogenicityScore"].map(lambda x : 1 if x>0.5 else 0)
Repitope_result_CM = confusion_matrix(Repitope_result["Immunogenicity"],Repitope_result["pre_label"])
Repitope_result_CM = Repitope_result_CM/len(Repitope_result)

def net_CM(data):
  data["pre_label"] = data["prediction"].map(lambda x : 1 if x<10 else 0)
  data_CM = confusion_matrix(data["Immunogenicity"],data["pre_label"])
  data_CM = data_CM/len(data)
  return data_CM


net_15mer_ba_CM = net_CM(net_15mer_ba)
net_15mer_el_CM = net_CM(net_15mer_el)
net_kmer_ba_CM = net_CM(net_kmer_ba)
net_kmer_el_CM = net_CM(net_kmer_el)
CM_all = [result_15mer_CM,result_kmer_CM,IEDB_result_CM,Repitope_result_CM,net_15mer_ba_CM,net_15mer_el_CM,net_kmer_ba_CM,net_kmer_el_CM]
CM_all = [result_15mer_CM,IEDB_result_CM,Repitope_result_CM,net_15mer_ba_CM,net_15mer_el_CM,result_kmer_CM,net_kmer_ba_CM,net_kmer_el_CM]
CM_label = ["TLimmuno2_15mer","IEDB","Repitope","NetMHCIIpanBA_15mer","NetMHCIIpanEL_15mer","TLimmuon2_kmer","NetMHCIIpanBA_kmer","NetMHCIIpanEL_kmer"]
def CM_plot(data,label):
  fig,ax = plt.subplots(figsize = (4,4))
  sns.heatmap(data,annot =True,cmap="YlGnBu",vmin=0, vmax=1,cbar = False)
  plt.ylabel("True label")
  plt.xlabel("Prediction label")
  plt.title("{}".format(label))
  
  plt.savefig("../figure/{}.pdf".format(label),dpi = 300,transparent=True)

for i,l in enumerate(CM_all):
  CM_plot(l,CM_label[i])
  
#> <string>:2: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).

Structure immportance

Similarly, we want to demonstrate the impact of transfer learning on TLimmuno2, so we reconstructed two models, only BA and without BA. For the only BA model, its input layer contains only the transfer output of the BA model, while for the without BA, we masked the BA model and only the LSTM layer left. The detail code of this part is not inclued in the report. In independent validation datasets, the performance of these two models is significant decreased compared with TLimmuno2. These results demonstrated that the transfer learning approach improves the predictive power of TLimmuno2 in peptide-MHC II immunogenicity prediction.

kmer_no_ba = pd.read_csv("/Users/wangguangshuai/Data/MHCII/Data/IMM_model_benchmark/kmer_no_ba_result.csv")
sam_pep_ID = kmer_no_ba["pep_ID"].unique()
sam_pep_ID = kmer_no_ba["pep_ID"].unique()
res = []
p_ID = []
for a in sam_pep_ID:
    Mean = []
    ID = kmer_no_ba[kmer_no_ba["pep_ID"] == a]
    for i in ID["HLA"].unique():
        x = ID[ID["HLA"] == i]
        #x_mean = x["Rank"].mean()
        x_mean = x["prediction"].max()
        Mean.append(x_mean)
    res.append(max(Mean))
    p_ID.append(a)
dic = {"pep_ID":p_ID,"prediction":res}
kmer_no_ba_result = pd.DataFrame(dic)
kmer_no_ba_result = kmer_no_ba_result.merge(peptide,how = "inner",on = "pep_ID" )
kmer_no_ba = auc_pr(kmer_no_ba_result,"Immunogenicity","prediction")
print("AUC:{}".format(kmer_no_ba[4]))
#> AUC:0.6920828538550057
print("PR_AUC:{}".format(kmer_no_ba[5]))
#> PR_AUC:0.14357445791350346
kmer_only_ba = pd.read_csv("/Users/wangguangshuai/Data/MHCII/Data/IMM_model_benchmark/kmer_onlyBA_result.csv")
sam_pep_ID = kmer_only_ba["pep_ID"].unique()
res = []
p_ID = []
for a in sam_pep_ID:
    Mean = []
    ID = kmer_only_ba[kmer_only_ba["pep_ID"] == a]
    for i in ID["HLA"].unique():
        x = ID[ID["HLA"] == i]
        #x_mean = x["Rank"].mean()
        x_mean = x["prediction"].max()
        Mean.append(x_mean)
    res.append(max(Mean))
    p_ID.append(a)
dic = {"pep_ID":p_ID,"prediction":res}
kmer_only_ba_result = pd.DataFrame(dic)
kmer_only_ba_result = kmer_only_ba_result.merge(peptide,how = "inner",on = "pep_ID" )
kmer_only_ba = auc_pr(kmer_only_ba_result,"Immunogenicity","prediction")
print("AUC:{}".format(kmer_only_ba[4]))
#> AUC:0.6529267357115459
print("PR_AUC:{}".format(kmer_only_ba[5]))
#> PR_AUC:0.13744676908175116
z_score_noba,p_val_noba = DelongTest(result_kmer["prediction"],kmer_no_ba_result["prediction"],result_kmer["Immunogenicity"]).show_result()
z_score_only_ba,p_val_only_ba = DelongTest(result_kmer["prediction"],kmer_only_ba_result["prediction"],result_kmer["Immunogenicity"]).show_result()
fig,ax = plt.subplots(figsize = (8,8))
lw = 2
ax.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
#> [<matplotlib.lines.Line2D object at 0x17876ef70>]
ax.plot(kmer_result[0],kmer_result[1],label='TLimmuno2 : (AUC={0:.4f})'.format(kmer_result[4]))
#> [<matplotlib.lines.Line2D object at 0x295ec0bb0>]
ax.plot(kmer_no_ba[0],kmer_no_ba[1],label='TLimmuno2_noba : (AUC={0:.4f})'.format(kmer_no_ba[4]))
#> [<matplotlib.lines.Line2D object at 0x17ed31a30>]
ax.set_ylim([0,1.05])
#> (0.0, 1.05)
ax.set_xlim([0,1])
#> (0.0, 1.0)
ax.set_xlabel('False Positive Rate',fontsize = 14)
#> Text(0.5, 0, 'False Positive Rate')
ax.set_ylabel('True Positive Rate',fontsize = 14)
#> Text(0, 0.5, 'True Positive Rate')
ax.text(0.78,0.2,"P value: {:.4f}".format(p_val_noba),weight="bold")
#> Text(0.78, 0.2, 'P value: 0.0142')
ax.legend(loc="lower right",fontsize = 12)
#> <matplotlib.legend.Legend object at 0x28eddc4f0>
plt.savefig("../figure/kmer_noba.pdf",dpi = 300,transparent=True)
plt.show()

fig,ax = plt.subplots(figsize = (8,8))
lw = 2
ax.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
#> [<matplotlib.lines.Line2D object at 0x295dfb0d0>]
ax.plot(kmer_result[0],kmer_result[1],label='TLimmuno2 : (AUC={0:.4f})'.format(kmer_result[4]))
#> [<matplotlib.lines.Line2D object at 0x295dfb670>]
ax.plot(kmer_only_ba[0],kmer_only_ba[1],label='TLimmuno2_onlyba : (AUC={0:.4f})'.format(kmer_only_ba[4]))
#> [<matplotlib.lines.Line2D object at 0x295dfb6a0>]
ax.set_ylim([0,1.05])
#> (0.0, 1.05)
ax.set_xlim([0,1])
#> (0.0, 1.0)
ax.set_xlabel('False Positive Rate',fontsize = 14)
#> Text(0.5, 0, 'False Positive Rate')
ax.set_ylabel('True Positive Rate',fontsize = 14)
#> Text(0, 0.5, 'True Positive Rate')
ax.text(0.78,0.2,"P value: {:.4f}".format(p_val_only_ba),weight="bold")
#> Text(0.78, 0.2, 'P value: 0.0042')
ax.legend(loc="lower right",fontsize = 12)
#> <matplotlib.legend.Legend object at 0x295d566d0>
plt.savefig("../figure/kmer_onlyba.pdf",dpi = 300,transparent=True)
plt.show()

Neoepitope and wildtype peptide

To further demonstrate the discriminative power of TLimmuno2, we collected 52 MHC II neoepitopes and their corresponding wild-type peptides from published papers . These neoepitopes have been experimentally shown to elicit T cell responses. For each neoantigen and its corresponding wild-type peptide, we predicted the immunogenicity rank values by TLimmuno2. As shown in Figure below, the TLimmuno2 predicted immunogenicity rank values of mutant neopeptides are significantly lower than those of wild-type peptides, and this demonstrates the strong ability of TLimmuno2 in recognizing real neoepitope, even if the neoepitope differs from its wild-type peptide by only one amino acid.

#python
neoantigen = pd.read_csv("../data/neoantigen.csv")
pseudo = pd.read_table("../data/pseudosequence.2016.all.X.dat",header = None,names = ["allele","sequence"])
neoantigen_process = pd.merge(neoantigen,pseudo,left_on = "HLA",right_on = "allele").dropna()
neoantigen_process.to_csv("../data/neoantigen_pro.csv")
#python

result = pd.read_csv("../data/neoantigen_result.csv")
result = result[result["immunogenic"] == 1]
result.index = range(len(result))
Rank = pd.melt(result[["WT_Rank","Mut_Rank"]],var_name = "Type",value_name = "Rank")
from scipy import stats
_,p = stats.wilcoxon(result["WT_Rank"],result["Mut_Rank"])
fig,ax = plt.subplots(figsize = (4,6))
ax = sns.boxplot(data = Rank,x = "Type",y = "Rank",width = 0.4,color = "skyblue",linewidth = 2,showfliers = False)
b = sns.stripplot(data = Rank,x = "Type",y = "Rank",color = "black",jitter = False)

for i in range(len(result)):
  plt.plot([0,1],[result["WT_Rank"][i],result["Mut_Rank"][i]],c ="gray",lw = 0.25)
#> [<matplotlib.lines.Line2D object at 0x1789714f0>]
#> [<matplotlib.lines.Line2D object at 0x1789719a0>]
#> [<matplotlib.lines.Line2D object at 0x295e83100>]
#> [<matplotlib.lines.Line2D object at 0x295e83880>]
#> [<matplotlib.lines.Line2D object at 0x295e839a0>]
#> [<matplotlib.lines.Line2D object at 0x295d46eb0>]
#> [<matplotlib.lines.Line2D object at 0x295d46e50>]
#> [<matplotlib.lines.Line2D object at 0x295d46220>]
#> [<matplotlib.lines.Line2D object at 0x295d466d0>]
#> [<matplotlib.lines.Line2D object at 0x295d46700>]
#> [<matplotlib.lines.Line2D object at 0x295e9fc70>]
#> [<matplotlib.lines.Line2D object at 0x295e9fa90>]
#> [<matplotlib.lines.Line2D object at 0x295e9f640>]
#> [<matplotlib.lines.Line2D object at 0x295e9f100>]
#> [<matplotlib.lines.Line2D object at 0x295e82fa0>]
#> [<matplotlib.lines.Line2D object at 0x295e82c10>]
#> [<matplotlib.lines.Line2D object at 0x295e82340>]
#> [<matplotlib.lines.Line2D object at 0x295e82790>]
#> [<matplotlib.lines.Line2D object at 0x296060dc0>]
#> [<matplotlib.lines.Line2D object at 0x296060a00>]
#> [<matplotlib.lines.Line2D object at 0x296060280>]
#> [<matplotlib.lines.Line2D object at 0x2960608b0>]
#> [<matplotlib.lines.Line2D object at 0x296060eb0>]
#> [<matplotlib.lines.Line2D object at 0x296060b50>]
#> [<matplotlib.lines.Line2D object at 0x295edba00>]
#> [<matplotlib.lines.Line2D object at 0x296114250>]
#> [<matplotlib.lines.Line2D object at 0x296114d30>]
#> [<matplotlib.lines.Line2D object at 0x296114790>]
#> [<matplotlib.lines.Line2D object at 0x296114460>]
#> [<matplotlib.lines.Line2D object at 0x295dfb790>]
#> [<matplotlib.lines.Line2D object at 0x295dfbf10>]
#> [<matplotlib.lines.Line2D object at 0x1787047f0>]
#> [<matplotlib.lines.Line2D object at 0x1787cb640>]
#> [<matplotlib.lines.Line2D object at 0x2d2225430>]
#> [<matplotlib.lines.Line2D object at 0x295d8f7f0>]
#> [<matplotlib.lines.Line2D object at 0x295d9a5e0>]
#> [<matplotlib.lines.Line2D object at 0x2960132b0>]
#> [<matplotlib.lines.Line2D object at 0x295d87b20>]
#> [<matplotlib.lines.Line2D object at 0x17890ed00>]
#> [<matplotlib.lines.Line2D object at 0x29602d700>]
#> [<matplotlib.lines.Line2D object at 0x29605be80>]
#> [<matplotlib.lines.Line2D object at 0x296122a90>]
#> [<matplotlib.lines.Line2D object at 0x296013c70>]
#> [<matplotlib.lines.Line2D object at 0x296013a00>]
#> [<matplotlib.lines.Line2D object at 0x17ea62f70>]
#> [<matplotlib.lines.Line2D object at 0x295eb0d90>]
#> [<matplotlib.lines.Line2D object at 0x295eb08e0>]
#> [<matplotlib.lines.Line2D object at 0x295eb03a0>]
#> [<matplotlib.lines.Line2D object at 0x295eb03d0>]
#> [<matplotlib.lines.Line2D object at 0x295eb0fa0>]
#> [<matplotlib.lines.Line2D object at 0x17890f1f0>]
#> [<matplotlib.lines.Line2D object at 0x17874ab20>]
x1, x2 = 0, 1   
y, h, col = Rank['Rank'].max() + 0.1, 0.1, 'k'
plt.plot([x1, x1, x2, x2], [y, y+h, y+h, y], lw=1, c=col)
#> [<matplotlib.lines.Line2D object at 0x17e8baf10>]
plt.text((x1+x2)*.5, y+h, "**", ha='center', va='bottom', color=col)
#> Text(0.5, 1.0549238341796203, '**')
ax.text(0.3,0.9,"P = {:.4f}".format(p),weight="bold")
#> Text(0.3, 0.9, 'P = 0.0009')
plt.ylim([0,1.2])
#> (0.0, 1.2)
plt.yticks(np.arange(0,1.2,0.2))
#> ([<matplotlib.axis.YTick object at 0x178774760>, <matplotlib.axis.YTick object at 0x1787744c0>, <matplotlib.axis.YTick object at 0x1787231c0>, <matplotlib.axis.YTick object at 0x295ec09d0>, <matplotlib.axis.YTick object at 0x17876ea30>, <matplotlib.axis.YTick object at 0x17876e220>], [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])
plt.xticks((0,1),labels = ["WT peptide","Neoantigen"])
#> ([<matplotlib.axis.XTick object at 0x178723cd0>, <matplotlib.axis.XTick object at 0x1787232e0>], [Text(0, 0, 'WT peptide'), Text(1, 0, 'Neoantigen')])
plt.tight_layout()
sns.despine(offset = 3, trim = True)
plt.savefig("../figure/neoantigen.pdf",dpi = 300,transparent=True)
plt.show()

#python
result = pd.read_csv("../data/neoantigen_result.csv")
result = result[result["immunogenic"] == 1]
Rank = pd.melt(result[["WT_prediction","Mut_prediction"]],var_name = "Type",value_name = "Score")
from scipy import stats
_,p = stats.wilcoxon(result["WT_prediction"],result["Mut_prediction"])
fig,ax = plt.subplots(figsize = (4,6))
ax = sns.boxplot(data = Rank,x = "Type",y = "Score",width = 0.4,color = "skyblue",linewidth = 2,showfliers = False)
b = sns.stripplot(data = Rank,x = "Type",y = "Score",color = "red")
sns.lineplot(data = Rank,x = "Type",y = "Score",color = "red")
#> <AxesSubplot:xlabel='Type', ylabel='Score'>
x1, x2 = 0, 1   
y, h, col = Rank['Score'].max() + 0.1, 0.1, 'k'
plt.plot([x1, x1, x2, x2], [y, y+h, y+h, y], lw=1, c=col)
#> [<matplotlib.lines.Line2D object at 0x17ed88550>]
plt.text((x1+x2)*.5, y+h, "**", ha='center', va='bottom', color=col)
#> Text(0.5, 1.17707355, '**')
ax.text(0.3,0.9,"P = {:.4f}".format(p),weight="bold")
#> Text(0.3, 0.9, 'P = 0.0004')
plt.ylim([0,1.2])
#> (0.0, 1.2)
plt.yticks(np.arange(0,1.2,0.2))
#> ([<matplotlib.axis.YTick object at 0x29609e280>, <matplotlib.axis.YTick object at 0x29609eac0>, <matplotlib.axis.YTick object at 0x29605ba30>, <matplotlib.axis.YTick object at 0x17e0f8760>, <matplotlib.axis.YTick object at 0x17e84baf0>, <matplotlib.axis.YTick object at 0x296035f70>], [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])
plt.xticks((0,1),labels = ["WT peptide","Neoantigen"])
#> ([<matplotlib.axis.XTick object at 0x29605b370>, <matplotlib.axis.XTick object at 0x29605b1c0>], [Text(0, 0, 'WT peptide'), Text(1, 0, 'Neoantigen')])
plt.tight_layout()
sns.despine(offset = 3, trim = True)
#plt.savefig("../figure/neoantigen_prediction.pdf",dpi = 300,transparent=True)
plt.show()

Position importance

TLimmuno2

We performed in silico mutational analyses to investigate whether TLimmuno2 was able to learn the molecular characteristics of the peptide. By changing the amino acid residues at each position, we compare the differences of the predicted values, and the largest decrease corresponds to the greater importance of the position. We do this in two methods: ala-scanning and zero-setting. For ala-scanning, we sequentially replaced each position as the alanine; for zero-setting, we sequentially mutated each position (the representation vector for the mutated position is all zeros). We simulated this process 100 times and an ascending ranking was performed each time to highlight the most salient positions.

#python
#PBS
import pandas as pd
import tensorflow as tf
import sys
sys.path.append("/public/slst/home/wanggsh/Project/MHCII/utils")
from Blosum62 import blosum62
import numpy as np
import random
import matplotlib.pyplot as plt
import matplotlib as mpl
import os
#os.environ["CUDA_VISIBLE_DEVICES"]="-1" 

immuno_data = pd.read_csv("/public/slst/home/wanggsh/Data/MHCII_immuno/IMM_data_random2.csv")
immuno_data = immuno_data[immuno_data["label"] == 1]
immuno_data["length"] = immuno_data["Description"].map(len)
#choose 15mer
immuno_data = immuno_data[immuno_data["length"] == 15]
immuno_data = immuno_data.reset_index(drop=True)

immuno_data["pep_blosum"] = immuno_data["Description"].apply(blosum62,args=(21,))
immuno_data["MHC_blosum"] = immuno_data["sequence"].apply(blosum62,args=(34,))
peptide = np.empty((len(immuno_data),21,21))
for i in range(len(immuno_data)):
    peptide[i] = immuno_data["pep_blosum"][i].reshape((21,21))
#mhc 
MHC = np.empty((len(immuno_data),34,21))
for i in range(len(immuno_data)):
    MHC[i] = immuno_data["MHC_blosum"][i].reshape((34,21))
Ala = list(blosum62("A",1))

BAmodel = tf.keras.models.load_model("/public/slst/home/wanggsh/Project/Saved_model/new_BA_model/model1")
BAmodel = tf.keras.models.Model(inputs = BAmodel.input,outputs = BAmodel.layers[-2].output)
IMM = tf.keras.models.load_model("/public/slst/home/wanggsh/Project/Saved_model/MHCII_immuno/model")

n = 100
position_rank = np.empty([n,15])
for m in range(n):
    index = np.random.choice(np.arange(len(immuno_data)),2000)
    peptide_sam = peptide[index].copy()
    MHC_sam = MHC[index]
    BA = BAmodel.predict([peptide_sam,MHC_sam])
    pred_ori = IMM([peptide_sam,MHC_sam,BA]).numpy().mean()  
    array = []  
    for i in range(15):
        peptide_sam[:,i,:] = 0
        BA = BAmodel.predict([peptide_sam,MHC_sam])
        importance = pred_ori - IMM([peptide_sam,MHC_sam,BA]).numpy().mean()
        array.append(importance)
        peptide_sam = peptide[index].copy()
    ranking = np.argsort(array) + 1
    tmp = []
    for i in range(15):
        tmp.append(list(ranking).index(i+1))
    position_rank[m,:] = tmp
np.save("/public/slst/home/wanggsh/Data/MHCII_immuno/position_importance/position.npy",position_rank)

n = 100
position_rank_ala = np.empty([n,15])
for m in range(n):
    index = np.random.choice(np.arange(len(immuno_data)),2000)
    peptide_sam = peptide[index].copy()
    MHC_sam = MHC[index]
    BA = BAmodel.predict([peptide_sam,MHC_sam])
    pred_ori = IMM([peptide_sam,MHC_sam,BA]).numpy().mean()  
    array = []  
    for i in range(15):
        peptide_sam[:,i,:] = Ala
        BA = BAmodel.predict([peptide_sam,MHC_sam])
        importance = pred_ori - IMM([peptide_sam,MHC_sam,BA]).numpy().mean()
        array.append(importance)
        peptide_sam = peptide[index].copy()
    ranking = np.argsort(array) + 1
    tmp = []
    for i in range(15):
        tmp.append(list(ranking).index(i+1))
    position_rank[m,:] = tmp
np.save("/public/slst/home/wanggsh/Data/MHCII_immuno/position_importance/position_ala.npy",position_rank)

plt figure

#python
from collections import Counter
position_rank = np.load("../data/position.npy")
cmap = mpl.cm.get_cmap('tab20')
delim = np.linspace(0,1,15)
colors = [mpl.colors.rgb2hex(cmap(i)[:4]) for i in delim]
fig,ax = plt.subplots(figsize = (6,4))
for i in np.arange(15):
    y = list(Counter(position_rank[:,i]+1).keys())
    s = list(Counter(position_rank[:,i]+1).values())
    ax.scatter([i for n in range(len(y))],y, s=[m*3.5 for m in s],c=colors[i])
#> <matplotlib.collections.PathCollection object at 0x295ec06a0>
#> <matplotlib.collections.PathCollection object at 0x295ec05b0>
#> <matplotlib.collections.PathCollection object at 0x2960a8280>
#> <matplotlib.collections.PathCollection object at 0x2960a8310>
#> <matplotlib.collections.PathCollection object at 0x2960a8be0>
#> <matplotlib.collections.PathCollection object at 0x29603a760>
#> <matplotlib.collections.PathCollection object at 0x29603aac0>
#> <matplotlib.collections.PathCollection object at 0x295ec0d60>
#> <matplotlib.collections.PathCollection object at 0x295ec0430>
#> <matplotlib.collections.PathCollection object at 0x2d20e3790>
#> <matplotlib.collections.PathCollection object at 0x295edd3d0>
#> <matplotlib.collections.PathCollection object at 0x2d20e3340>
#> <matplotlib.collections.PathCollection object at 0x2964eb430>
#> <matplotlib.collections.PathCollection object at 0x295ec0730>
#> <matplotlib.collections.PathCollection object at 0x2964eb4f0>
ax.set_ylim(0,16)
#> (0.0, 16.0)
ax.set_yticks(np.arange(15)+1)
#> [<matplotlib.axis.YTick object at 0x29606faf0>, <matplotlib.axis.YTick object at 0x281fb6a30>, <matplotlib.axis.YTick object at 0x295e4d460>, <matplotlib.axis.YTick object at 0x29653bdf0>, <matplotlib.axis.YTick object at 0x2960a4b20>, <matplotlib.axis.YTick object at 0x2d20c75b0>, <matplotlib.axis.YTick object at 0x2964c3820>, <matplotlib.axis.YTick object at 0x2d20c72e0>, <matplotlib.axis.YTick object at 0x2964c3f70>, <matplotlib.axis.YTick object at 0x2964c3a60>, <matplotlib.axis.YTick object at 0x2960ad760>, <matplotlib.axis.YTick object at 0x2960ad190>, <matplotlib.axis.YTick object at 0x2d2181e50>, <matplotlib.axis.YTick object at 0x2960adb20>, <matplotlib.axis.YTick object at 0x2964c3730>]
ax.set_xticks(np.arange(15))
#> [<matplotlib.axis.XTick object at 0x281fb65b0>, <matplotlib.axis.XTick object at 0x281fb6af0>, <matplotlib.axis.XTick object at 0x29609ea00>, <matplotlib.axis.XTick object at 0x2d2181dc0>, <matplotlib.axis.XTick object at 0x2d21a2190>, <matplotlib.axis.XTick object at 0x2d21a28e0>, <matplotlib.axis.XTick object at 0x2d21a3070>, <matplotlib.axis.XTick object at 0x2d21a2b50>, <matplotlib.axis.XTick object at 0x2960ad640>, <matplotlib.axis.XTick object at 0x2964c35e0>, <matplotlib.axis.XTick object at 0x2d21a3670>, <matplotlib.axis.XTick object at 0x2d21a3cd0>, <matplotlib.axis.XTick object at 0x2d2199460>, <matplotlib.axis.XTick object at 0x2d2199bb0>, <matplotlib.axis.XTick object at 0x2d2199700>]
ax.set_xticklabels(['1','2','3','4','5','6','7','8','9','10','11','12','13','14','15'])
#> [Text(0, 0, '1'), Text(1, 0, '2'), Text(2, 0, '3'), Text(3, 0, '4'), Text(4, 0, '5'), Text(5, 0, '6'), Text(6, 0, '7'), Text(7, 0, '8'), Text(8, 0, '9'), Text(9, 0, '10'), Text(10, 0, '11'), Text(11, 0, '12'), Text(12, 0, '13'), Text(13, 0, '14'), Text(14, 0, '15')]
ax.set_xlabel('Zero setting')
#> Text(0.5, 0, 'Zero setting')
ax.set_ylabel('Ranking(ascending)')
#> Text(0, 0.5, 'Ranking(ascending)')
ax.grid(True,alpha=0.2)
h1 = [ax.plot([],[],color='grey',marker='o',markersize=i,ls='')[0] for i in range(8,15,2)]
leg1 = ax.legend(handles=h1,labels=[10,40,70,100],title='Frequency',loc='lower left',bbox_to_anchor=(0.8,0.6),frameon=False)
ax.add_artist(leg1)
#> <matplotlib.legend.Legend object at 0x2960b0040>
plt.savefig("../figure/position_zero.pdf",dpi = 300,transparent=True)
plt.show()

#python
position_rank_ala = np.load("../data/position_ala.npy")
cmap = mpl.cm.get_cmap('tab20')
delim = np.linspace(0,1,15)
colors = [mpl.colors.rgb2hex(cmap(i)[:4]) for i in delim]
fig,ax = plt.subplots(figsize = (6,4))
for i in np.arange(15):
    y = list(Counter(position_rank_ala[:,i]+1).keys())
    s = list(Counter(position_rank_ala[:,i]+1).values())
    ax.scatter([i for n in range(len(y))],y, s=[m*3.5 for m in s],c=colors[i])
#> <matplotlib.collections.PathCollection object at 0x295ecfa60>
#> <matplotlib.collections.PathCollection object at 0x295ecf4f0>
#> <matplotlib.collections.PathCollection object at 0x295ecf670>
#> <matplotlib.collections.PathCollection object at 0x296231490>
#> <matplotlib.collections.PathCollection object at 0x296231ee0>
#> <matplotlib.collections.PathCollection object at 0x295ecdfa0>
#> <matplotlib.collections.PathCollection object at 0x295ecd430>
#> <matplotlib.collections.PathCollection object at 0x2d2185e50>
#> <matplotlib.collections.PathCollection object at 0x2d2185bb0>
#> <matplotlib.collections.PathCollection object at 0x2962315b0>
#> <matplotlib.collections.PathCollection object at 0x2d2185be0>
#> <matplotlib.collections.PathCollection object at 0x296085eb0>
#> <matplotlib.collections.PathCollection object at 0x296191460>
#> <matplotlib.collections.PathCollection object at 0x296191ac0>
#> <matplotlib.collections.PathCollection object at 0x296191ee0>
ax.set_ylim(0,16)
#> (0.0, 16.0)
ax.set_yticks(np.arange(15)+1)
#> [<matplotlib.axis.YTick object at 0x2d21a2430>, <matplotlib.axis.YTick object at 0x2960ad040>, <matplotlib.axis.YTick object at 0x2d20c7e50>, <matplotlib.axis.YTick object at 0x295eaa580>, <matplotlib.axis.YTick object at 0x295eaabe0>, <matplotlib.axis.YTick object at 0x29610f730>, <matplotlib.axis.YTick object at 0x296121f40>, <matplotlib.axis.YTick object at 0x296121df0>, <matplotlib.axis.YTick object at 0x296197460>, <matplotlib.axis.YTick object at 0x2961212b0>, <matplotlib.axis.YTick object at 0x29610f9d0>, <matplotlib.axis.YTick object at 0x2961972b0>, <matplotlib.axis.YTick object at 0x29613d820>, <matplotlib.axis.YTick object at 0x29613d7f0>, <matplotlib.axis.YTick object at 0x296150e50>]
ax.set_xticks(np.arange(15))
#> [<matplotlib.axis.XTick object at 0x2960adcd0>, <matplotlib.axis.XTick object at 0x2960add60>, <matplotlib.axis.XTick object at 0x2d21817f0>, <matplotlib.axis.XTick object at 0x29613d910>, <matplotlib.axis.XTick object at 0x2961979a0>, <matplotlib.axis.XTick object at 0x2d21a28b0>, <matplotlib.axis.XTick object at 0x296150ca0>, <matplotlib.axis.XTick object at 0x295edf3d0>, <matplotlib.axis.XTick object at 0x295edfb20>, <matplotlib.axis.XTick object at 0x295ec82b0>, <matplotlib.axis.XTick object at 0x295edfbe0>, <matplotlib.axis.XTick object at 0x296121d00>, <matplotlib.axis.XTick object at 0x295eb95b0>, <matplotlib.axis.XTick object at 0x295eaa2b0>, <matplotlib.axis.XTick object at 0x2961502b0>]
ax.set_xticklabels(['1','2','3','4','5','6','7','8','9','10','11','12','13','14','15'])
#> [Text(0, 0, '1'), Text(1, 0, '2'), Text(2, 0, '3'), Text(3, 0, '4'), Text(4, 0, '5'), Text(5, 0, '6'), Text(6, 0, '7'), Text(7, 0, '8'), Text(8, 0, '9'), Text(9, 0, '10'), Text(10, 0, '11'), Text(11, 0, '12'), Text(12, 0, '13'), Text(13, 0, '14'), Text(14, 0, '15')]
ax.set_xlabel('Ala scanning')
#> Text(0.5, 0, 'Ala scanning')
ax.set_ylabel('Ranking(ascending)')
#> Text(0, 0.5, 'Ranking(ascending)')
ax.grid(True,alpha=0.2)
h1 = [ax.plot([],[],color='grey',marker='o',markersize=i,ls='')[0] for i in range(8,15,2)]
leg1 = ax.legend(handles=h1,labels=[10,40,70,100],title='Frequency',loc='lower left',bbox_to_anchor=(0.8,0.6),frameon=False)
ax.add_artist(leg1)
#> <matplotlib.legend.Legend object at 0x17e081610>
plt.savefig("../figure/position_ala.pdf",dpi = 300,transparent=True)
plt.show()

Both results show that P4 (residue 4), P5, and P6 are very important for model prediction. This result is consistent with the results related to MHC class I molecules, and it is reported that these positions are essential for interacting with the TCR. But unlike MHC I molecules, MHC II molecules exhibit a tendency to have less effect on model predictions due to its position closer to the tail, possibly due to its open binding cleft. The results of Ala-scanning are more stable rather than zero-setting, because Ala also has biological functions.

BA model

We also do the same things by using BA model.

import pandas as pd
import tensorflow as tf
import sys
sys.path.append("/public/slst/home/wanggsh/Project/MHCII/utils")
from Blosum62 import blosum62
import numpy as np
import random
import matplotlib.pyplot as plt
import matplotlib as mpl
import os
#os.environ["CUDA_VISIBLE_DEVICES"]="-1" 


immuno_data = pd.read_csv("/public/slst/home/wanggsh/Data/MHCII/netMHCpanII4.0_train_data/BA_data/BA_MHC_label.csv")
immuno_data = immuno_data[immuno_data["score"] == 1]
immuno_data["length"] = immuno_data["peptide"].map(len)
#choose 15mer
immuno_data = immuno_data[immuno_data["length"] == 15]
immuno_data = immuno_data.reset_index(drop=True)


immuno_data["pep_blosum"] = immuno_data["peptide"].apply(blosum62,args=(21,))
immuno_data["MHC_blosum"] = immuno_data["sequence"].apply(blosum62,args=(34,))
peptide = np.empty((len(immuno_data),21,21))
for i in range(len(immuno_data)):
    peptide[i] = immuno_data["pep_blosum"][i].reshape((21,21))
#mhc 
MHC = np.empty((len(immuno_data),34,21))
for i in range(len(immuno_data)):
    MHC[i] = immuno_data["MHC_blosum"][i].reshape((34,21))
Ala = list(blosum62("A",1))

BAmodel = tf.keras.models.load_model("/public/slst/home/wanggsh/Project/Saved_model/new_BA_model/model1")

n = 100
position_rank = np.empty([n,15])
for m in range(n):
    index = np.random.choice(np.arange(len(immuno_data)),2000)
    peptide_sam = peptide[index].copy()
    MHC_sam = MHC[index]
    pred_ori = BAmodel([peptide_sam,MHC_sam]).numpy().mean()  
    array = []  
    for i in range(15):
        peptide_sam[:,i,:] = 0
        importance = pred_ori - BAmodel([peptide_sam,MHC_sam]).numpy().mean()
        array.append(importance)
        peptide_sam = peptide[index].copy()
    ranking = np.argsort(array) + 1
    tmp = []
    for i in range(15):
        tmp.append(list(ranking).index(i+1))
    position_rank[m,:] = tmp
np.save("/public/slst/home/wanggsh/Data/MHCII_immuno/position_importance/position_zero_BA.npy",position_rank)

n = 100
position_rank_ala = np.empty([n,15])
for m in range(n):
    index = np.random.choice(np.arange(len(immuno_data)),2000)
    peptide_sam = peptide[index].copy()
    MHC_sam = MHC[index]
    pred_ori = BAmodel([peptide_sam,MHC_sam]).numpy().mean()  
    array = []  
    for i in range(15):
        peptide_sam[:,i,:] = Ala
        importance = pred_ori - BAmodel([peptide_sam,MHC_sam]).numpy().mean()
        array.append(importance)
        peptide_sam = peptide[index].copy()
    ranking = np.argsort(array) + 1
    tmp = []
    for i in range(15):
        tmp.append(list(ranking).index(i+1))
    position_rank[m,:] = tmp
np.save("/public/slst/home/wanggsh/Data/MHCII_immuno/position_importance/position_ala_BA.npy",position_rank)
#python
from collections import Counter
position_rank = np.load("../data/position_zero_BA.npy")
cmap = mpl.cm.get_cmap('tab20')
delim = np.linspace(0,1,15)
colors = [mpl.colors.rgb2hex(cmap(i)[:4]) for i in delim]
fig,ax = plt.subplots(figsize = (6,6))
for i in np.arange(15):
    y = list(Counter(position_rank[:,i]+1).keys())
    s = list(Counter(position_rank[:,i]+1).values())
    ax.scatter([i for n in range(len(y))],y, s=[m*3.5 for m in s],c=colors[i])
#> <matplotlib.collections.PathCollection object at 0x295dc8160>
#> <matplotlib.collections.PathCollection object at 0x295eb0730>
#> <matplotlib.collections.PathCollection object at 0x295eb08b0>
#> <matplotlib.collections.PathCollection object at 0x296121fd0>
#> <matplotlib.collections.PathCollection object at 0x295eb00a0>
#> <matplotlib.collections.PathCollection object at 0x295d0b970>
#> <matplotlib.collections.PathCollection object at 0x178971f10>
#> <matplotlib.collections.PathCollection object at 0x178971340>
#> <matplotlib.collections.PathCollection object at 0x17e91cdf0>
#> <matplotlib.collections.PathCollection object at 0x178971b50>
#> <matplotlib.collections.PathCollection object at 0x295dc8d30>
#> <matplotlib.collections.PathCollection object at 0x2960339d0>
#> <matplotlib.collections.PathCollection object at 0x296035250>
#> <matplotlib.collections.PathCollection object at 0x2960ad250>
#> <matplotlib.collections.PathCollection object at 0x29609e6d0>
ax.set_ylim(0,16)
#> (0.0, 16.0)
ax.set_yticks(np.arange(15)+1)
#> [<matplotlib.axis.YTick object at 0x295eb9280>, <matplotlib.axis.YTick object at 0x295eb97c0>, <matplotlib.axis.YTick object at 0x295eaa790>, <matplotlib.axis.YTick object at 0x295edb8e0>, <matplotlib.axis.YTick object at 0x295edb880>, <matplotlib.axis.YTick object at 0x1789382b0>, <matplotlib.axis.YTick object at 0x17ed313a0>, <matplotlib.axis.YTick object at 0x295e9f0a0>, <matplotlib.axis.YTick object at 0x295e9f580>, <matplotlib.axis.YTick object at 0x29602dc10>, <matplotlib.axis.YTick object at 0x295e9fcd0>, <matplotlib.axis.YTick object at 0x178938fa0>, <matplotlib.axis.YTick object at 0x29602dc40>, <matplotlib.axis.YTick object at 0x28d2af430>, <matplotlib.axis.YTick object at 0x17890e9d0>]
ax.set_xticks(np.arange(15))
#> [<matplotlib.axis.XTick object at 0x295ecda90>, <matplotlib.axis.XTick object at 0x295ecd970>, <matplotlib.axis.XTick object at 0x29606fe20>, <matplotlib.axis.XTick object at 0x296122f40>, <matplotlib.axis.XTick object at 0x296122df0>, <matplotlib.axis.XTick object at 0x29602ddf0>, <matplotlib.axis.XTick object at 0x295ec0130>, <matplotlib.axis.XTick object at 0x295cb2d90>, <matplotlib.axis.XTick object at 0x178907fa0>, <matplotlib.axis.XTick object at 0x1789076a0>, <matplotlib.axis.XTick object at 0x296070670>, <matplotlib.axis.XTick object at 0x1789076d0>, <matplotlib.axis.XTick object at 0x295ec0610>, <matplotlib.axis.XTick object at 0x296070a00>, <matplotlib.axis.XTick object at 0x178930370>]
ax.set_xticklabels(['1','2','3','4','5','6','7','8','9','10','11','12','13','14','15'])
#> [Text(0, 0, '1'), Text(1, 0, '2'), Text(2, 0, '3'), Text(3, 0, '4'), Text(4, 0, '5'), Text(5, 0, '6'), Text(6, 0, '7'), Text(7, 0, '8'), Text(8, 0, '9'), Text(9, 0, '10'), Text(10, 0, '11'), Text(11, 0, '12'), Text(12, 0, '13'), Text(13, 0, '14'), Text(14, 0, '15')]
ax.set_xlabel('Zero setting')
#> Text(0.5, 0, 'Zero setting')
ax.set_ylabel('Ranking(ascending)')
#> Text(0, 0.5, 'Ranking(ascending)')
ax.grid(True,alpha=0.2)
h1 = [ax.plot([],[],color='grey',marker='o',markersize=i,ls='')[0] for i in range(8,15,2)]
leg1 = ax.legend(handles=h1,labels=[10,40,70,100],title='Frequency',loc='lower left',bbox_to_anchor=(0.98,0.2),frameon=False)
ax.add_artist(leg1)
#> <matplotlib.legend.Legend object at 0x29609eee0>
plt.savefig("../figure/position_zero_BA.pdf",dpi = 300,bbox_inches='tight',transparent=True)
plt.show()

#python
position_rank_ala = np.load("../data/position_ala_BA.npy")
cmap = mpl.cm.get_cmap('tab20')
delim = np.linspace(0,1,15)
colors = [mpl.colors.rgb2hex(cmap(i)[:4]) for i in delim]
fig,ax = plt.subplots(figsize = (6,6))
for i in np.arange(15):
    y = list(Counter(position_rank_ala[:,i]+1).keys())
    s = list(Counter(position_rank_ala[:,i]+1).values())
    ax.scatter([i for n in range(len(y))],y, s=[m*3.5 for m in s],c=colors[i])
#> <matplotlib.collections.PathCollection object at 0x29602d370>
#> <matplotlib.collections.PathCollection object at 0x29602d280>
#> <matplotlib.collections.PathCollection object at 0x2af891a30>
#> <matplotlib.collections.PathCollection object at 0x29602d430>
#> <matplotlib.collections.PathCollection object at 0x2af8af5b0>
#> <matplotlib.collections.PathCollection object at 0x2af8afd60>
#> <matplotlib.collections.PathCollection object at 0x2af8afe80>
#> <matplotlib.collections.PathCollection object at 0x178938340>
#> <matplotlib.collections.PathCollection object at 0x178938f10>
#> <matplotlib.collections.PathCollection object at 0x2af8af7f0>
#> <matplotlib.collections.PathCollection object at 0x2af8afd90>
#> <matplotlib.collections.PathCollection object at 0x295d70670>
#> <matplotlib.collections.PathCollection object at 0x295d70a30>
#> <matplotlib.collections.PathCollection object at 0x29603a730>
#> <matplotlib.collections.PathCollection object at 0x295d70550>
ax.set_ylim(0,16)
#> (0.0, 16.0)
ax.set_yticks(np.arange(15)+1)
#> [<matplotlib.axis.YTick object at 0x295e4d5b0>, <matplotlib.axis.YTick object at 0x295d0b8e0>, <matplotlib.axis.YTick object at 0x296060fa0>, <matplotlib.axis.YTick object at 0x2961215b0>, <matplotlib.axis.YTick object at 0x296114460>, <matplotlib.axis.YTick object at 0x296060700>, <matplotlib.axis.YTick object at 0x17ecf2910>, <matplotlib.axis.YTick object at 0x17ed88970>, <matplotlib.axis.YTick object at 0x17ecf27f0>, <matplotlib.axis.YTick object at 0x296114250>, <matplotlib.axis.YTick object at 0x295df31f0>, <matplotlib.axis.YTick object at 0x2960a4610>, <matplotlib.axis.YTick object at 0x2960a4160>, <matplotlib.axis.YTick object at 0x29603a160>, <matplotlib.axis.YTick object at 0x295ec0640>]
ax.set_xticks(np.arange(15))
#> [<matplotlib.axis.XTick object at 0x296060fd0>, <matplotlib.axis.XTick object at 0x296060af0>, <matplotlib.axis.XTick object at 0x2960a4d00>, <matplotlib.axis.XTick object at 0x296121c10>, <matplotlib.axis.XTick object at 0x295ec04f0>, <matplotlib.axis.XTick object at 0x29653b280>, <matplotlib.axis.XTick object at 0x2960b0790>, <matplotlib.axis.XTick object at 0x17e9902b0>, <matplotlib.axis.XTick object at 0x2960b0eb0>, <matplotlib.axis.XTick object at 0x29653b160>, <matplotlib.axis.XTick object at 0x2d2191b50>, <matplotlib.axis.XTick object at 0x2d21917c0>, <matplotlib.axis.XTick object at 0x295eb9940>, <matplotlib.axis.XTick object at 0x295d0b0a0>, <matplotlib.axis.XTick object at 0x295ecdca0>]
ax.set_xticklabels(['1','2','3','4','5','6','7','8','9','10','11','12','13','14','15'])
#> [Text(0, 0, '1'), Text(1, 0, '2'), Text(2, 0, '3'), Text(3, 0, '4'), Text(4, 0, '5'), Text(5, 0, '6'), Text(6, 0, '7'), Text(7, 0, '8'), Text(8, 0, '9'), Text(9, 0, '10'), Text(10, 0, '11'), Text(11, 0, '12'), Text(12, 0, '13'), Text(13, 0, '14'), Text(14, 0, '15')]
ax.set_xlabel('Ala scaning')
#> Text(0.5, 0, 'Ala scaning')
ax.set_ylabel('Ranking(ascending)')
#> Text(0, 0.5, 'Ranking(ascending)')
ax.grid(True,alpha=0.2)
h1 = [ax.plot([],[],color='grey',marker='o',markersize=i,ls='')[0] for i in range(8,15,2)]
leg1 = ax.legend(handles=h1,labels=[10,40,70,100],title='Frequency',loc='lower left',bbox_to_anchor=(0.98,0.2),frameon=False)
ax.add_artist(leg1)
#> <matplotlib.legend.Legend object at 0x1787df040>
plt.savefig("../figure/position_ala_BA.pdf",dpi = 300,bbox_inches='tight',transparent=True)
plt.show()

Furthermore, the BA model does not show the same trend, it is possible that the features learned by the affinity prediction model are the interaction between MHC and peptide compared to the features learned by the immunogenicity prediction model.

Compare with another meachine learning and other deep learning methods

To further investigate the importance of model structure to TLimmuno2 performance, we used other model architectures and compared their performance. We constructed four traditional machine learning classification models (AdaBoost, KNN, Random Forest, and SVM) and tuned their hyperparameters by cross-validation. In addition, we constructed deep learning models of deep neural network (DNN) and convolutional neural network (CNN). All models were trained using the same data. We validated these models in our curated neoepitope dataset.

from sklearn.metrics import roc_auc_score,make_scorer 
scorer = make_scorer(roc_auc_score)
Blosum62_matrix = pd.read_csv("../data/BLOSUM62.csv",comment="#")
Protein_alphabet = list("ARNDCQEGHILKMFPSTWYVX")
Blosum62_matrix = Blosum62_matrix[Protein_alphabet]
Blosum62_matrix = Blosum62_matrix.loc[Protein_alphabet]

def blosum62(peptide,maxlen):
    encoder = np.empty((maxlen,21))
    if len(peptide) <=maxlen:
        peptide = peptide + "X"*(maxlen-len(peptide))
    for i in range(len(peptide)):
        pep = list(peptide)[i]
        coder = Blosum62_matrix[pep]
        encoder[i] = coder
    return encoder.flatten()
def auc_pr(data,true_label,prediction,rank = False):
    if rank:
        fpr,tpr,_ = roc_curve(data[true_label],1-data[prediction])
        precision,recall,_ = precision_recall_curve(data[true_label],1-data[prediction])
    else:
        fpr,tpr,_ = roc_curve(data[true_label],data[prediction])
        precision,recall,_ = precision_recall_curve(data[true_label],data[prediction])
        
    AUC = auc(fpr,tpr)
    PR = auc(recall,precision)
    
    return fpr,tpr,recall,precision,AUC,PR
def IMM_process(Data,peptide):
    sam_pep_ID = Data["pep_ID"].unique()
    res = []
    p_ID = []
    Data["length"] = Data["pep"].map(len)
    for a in sam_pep_ID:
        Mean = []    
        ID = Data[Data["pep_ID"] == a]
        for i in ID["HLA"].unique():
            Data_HLA = ID[ID["HLA"] == i]
            prediction = []
            for l in Data_HLA["length"].unique():
                Data_len = Data_HLA[Data_HLA["length"] == l]
                pre = Data_len["prediction"].max()
                prediction.append(pre)
            #x_mean = x["Rank"].mean()
            x_mean = np.mean(prediction)
            Mean.append(x_mean)
        res.append(max(Mean))
        p_ID.append(a)
    result = pd.DataFrame({"pep_ID":p_ID,"prediction":res})
    result = result.merge(peptide,how = "inner",on = "pep_ID" )
    
    return result
def evaluate(estimator,test_X,test_Y):
        pred = estimator.predict(test_X)
        result = roc_auc_score(test_Y,pred)
        return result
def evaluate2(estimator,data,peptide,test_X):
        data["prediction"] = estimator.predict(test_X)
        result = IMM_process(data,peptide)
        result = auc_pr(result,"Immunogenicity","prediction")
        return result[4]
def method(estimator):
    kf = KFold(n_splits=5)
    fold_indices = list(kf.split(np.arange(X.shape[0])))
    holding = {'validation': [], 'mer15': [],'kmer': []}
    for fold in fold_indices:
        # split
        train_X, train_Y, test_X, test_Y = X[fold[0]], np.array(Y)[fold[0]], X[fold[1]], np.array(Y)[fold[1]]
        # train
        estimator.fit(train_X, train_Y)
        # test in validation set
        result_validation = evaluate(estimator,test_X,test_Y)
        holding['validation'].append(result_validation)
        # test in mer15
        AUC = evaluate2(estimator,mer15,peptide,mer15_X)
        holding['mer15'].append(AUC)
        # test in kmer
        AUC = evaluate2(estimator,kmer,peptide,kmer_X)
        holding['kmer'].append(AUC)
    return holding

Data process

peptide = pd.read_csv("../data/peptide.csv")
immuno_data = pd.read_csv("../data/IMM_data_random2.csv")
immuno_data["pep_blosum"] = immuno_data["Description"].apply(blosum62,args=(21,))
immuno_data["MHC_blosum"] = immuno_data["sequence"].apply(blosum62,args=(34,))
X = np.empty((len(immuno_data),1155))
for i in range(len(immuno_data)):
    x = immuno_data.iloc[i,6]
    y = immuno_data.iloc[i,7]
    X[i, :] = np.concatenate((x,y))
Y = immuno_data["label"].values
mer15 = pd.read_csv("../data/15mer_data.csv")
mer15["pep_blosum"] = mer15["pep"].apply(blosum62,args=(21,))
mer15["MHC_blosum"] = mer15["sequence"].apply(blosum62,args=(34,))
mer15_X = np.empty((len(mer15),1155))
for i in range(len(mer15)):
    x = mer15.iloc[i,5]
    y = mer15.iloc[i,6]
    mer15_X[i,:] = np.concatenate((x,y))
kmer = pd.read_csv("../data/kmer_data.csv")
kmer["pep_blosum"] = kmer["pep"].apply(blosum62,args=(21,))
kmer["MHC_blosum"] = kmer["sequence"].apply(blosum62,args=(34,))
kmer_X = np.empty((len(kmer),1155))
for i in range(len(kmer)):
    x = kmer.iloc[i,5]
    y = kmer.iloc[i,6]
    kmer_X[i,:] = np.concatenate((x,y))
holder = {}

Machine learning

Adaboost Classifier

from sklearn.model_selection import cross_validate
from sklearn.ensemble import AdaBoostClassifier
cv_results = []
space = np.linspace(20, 220, 6)
for i in space:
    cv_result = cross_validate(AdaBoostClassifier(n_estimators=int(i)), X, Y, cv=3, scoring=scorer, n_jobs=-1,
                               verbose=5)
    cv_results.append(cv_result)

y1 = [item['test_score'].mean() for item in cv_results] 
y1_e = [item['test_score'].std() for item in cv_results]

ax1 = plt.subplot(1, 1, 1)
ax1.plot(np.arange(len(space)), y1, marker='o', markersize=5) 
ax1.fill_between(np.arange(len(space)), [y1[i] - y1_e[i] for i in range(len(space))],
                 [y1[i] + y1_e[i] for i in range(len(space))], alpha=0.2) #
ax1.set_xticks(np.arange(len(space)))
ax1.set_xticklabels(['{0:.2f}'.format(i) for i in space])
plot.show()
estimator = AdaBoostClassifier(n_estimators=180)
holder['Adaboost'] = method(estimator = estimator)

KNN Classifier

# KNN Classifier
cv_results = []
from sklearn.neighbors import KNeighborsClassifier
space = np.linspace(1,100,10)
for i in space:
    cv_result = cross_validate(KNeighborsClassifier(n_neighbors=int(i)),X,Y,cv=5,scoring=rmse,n_jobs=-1,verbose=5)
    cv_results.append(cv_result)
y1 = [item['test_score'].mean() for item in cv_results]
y1_e = [item['test_score'].std() for item in cv_results]
ax1 = plt.subplot(1,1,1)
ax1.plot(np.arange(len(space)),y1,marker='o',markersize=5)
ax1.fill_between(np.arange(len(space)),[y1[i]-y1_e[i] for i in range(len(space))],[y1[i]+y1_e[i] for i in range(len(space))],alpha=0.2)
ax1.set_xticks(np.arange(len(space)))
ax1.set_xticklabels(['{0:.2f}'.format(i) for i in space])
estimator = KNeighborsClassifier(n_neighbors=1)
holder['KNN'] = method(estimator = estimator)

Random forest Classifier

cv_results = []
from sklearn.ensemble import RandomForestClassifier
space = np.linspace(1, 100, 20)
for i in space:
    cv_result = cross_validate(RandomForestClassifier(n_estimators=200,min_samples_leaf=int(i)), X, Y, cv=3, scoring=rmse, n_jobs=-1,
                               verbose=5)
    cv_results.append(cv_result)
y1 = [item['test_score'].mean() for item in cv_results]
y1_e = [item['test_score'].std() for item in cv_results]
ax1 = plt.subplot(1, 1, 1)
ax1.plot(np.arange(len(space)), y1, marker='o', markersize=5)
ax1.fill_between(np.arange(len(space)), [y1[i] - y1_e[i] for i in range(len(space))],
                 [y1[i] + y1_e[i] for i in range(len(space))], alpha=0.2)
ax1.set_xticks(np.arange(len(space)))
ax1.set_xticklabels(['{0:.2f}'.format(i) for i in space])
estimator = RandomForestClassifier(n_estimators=200,min_samples_leaf=1)
holder['RF'] = method(estimator = estimator)

Support Vector Machine (SVC)

cv_results = []
from sklearn.svm import LinearSVC
space = np.logspace(-3, 3, 7)
for i in space:
    cv_result = cross_validate(LinearSVC(C=i), X, Y, cv=3, scoring=rmse, n_jobs=-1,
                               verbose=5)
    cv_results.append(cv_result)
y1 = [item['test_score'].mean() for item in cv_results]
y1_e = [item['test_score'].std() for item in cv_results]
ax1 = plt.subplot(1, 1, 1)
ax1.plot(np.arange(len(space)), y1, marker='o', markersize=5)
ax1.fill_between(np.arange(len(space)), [y1[i] - y1_e[i] for i in range(len(space))],
                 [y1[i] + y1_e[i] for i in range(len(space))], alpha=0.2)
ax1.set_xticks(np.arange(len(space)))
ax1.set_xticklabels(['{0:.2f}'.format(i) for i in space])
estimator = LinearSVC(C=0.01)
holder['SVM'] = method(estimator = estimator)
np.save("../data/another_machine_learning.npy",holder)
holder = np.load("../data/another_machine_learning.npy",allow_pickle=True)
mean_15mer = []
mean_kmer = []
method_type = []
method_type.append("TLimmuno2")
mean_15mer.append(0.7332)
mean_kmer.append(0.7372)
for i in holder.tolist().keys():
  DF = pd.DataFrame(holder.tolist()[i])
  mean_15mer.append(DF["mer15"].mean())
  mean_kmer.append(DF["kmer"].mean())
  method_type.append(i)
method_type[1] = "Adaboost"

Deep learning

#PBS
#python
import pandas as pd
import os
import tensorflow as tf
from sklearn.model_selection import train_test_split,KFold
import numpy as np
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
from sklearn.metrics import auc,precision_recall_curve,roc_curve,confusion_matrix,average_precision_score
import matplotlib.pyplot as plt
import sys
sys.path.append("/public/slst/home/wanggsh/Project/MHCII/utils")
from Blosum62 import blosum62
#Data process

immuno_data = pd.read_csv("/public/slst/home/wanggsh/Data/MHCII_immuno/IMM_data_random2.csv")
immuno_data["pep_blosum"] = immuno_data["Description"].apply(blosum62,args=(21,))
immuno_data["MHC_blosum"] = immuno_data["sequence"].apply(blosum62,args=(34,))
peptide = np.empty((len(immuno_data),21,21))
for i in range(len(immuno_data)):
    peptide[i] = immuno_data["pep_blosum"][i].reshape((21,21))
#mhc 
MHC = np.empty((len(immuno_data),34,21))
for i in range(len(immuno_data)):
    MHC[i] = immuno_data["MHC_blosum"][i].reshape((34,21))
labels = immuno_data["label"].values

peptide_CNN = np.empty((len(immuno_data),21,21,1))
for i in range(len(immuno_data)):
    peptide_CNN[i,:,:,] = immuno_data["pep_blosum"][i].reshape((21,21,1))


MHC_CNN = np.empty((len(immuno_data),34,21,1))
for i in range(len(immuno_data)):
    MHC_CNN[i] = immuno_data["MHC_blosum"][i].reshape((34,21,1))
labels_CNN = immuno_data["label"].values

def DNN():
    pep = tf.keras.Input(shape = (21,21))
    MHC = tf.keras.Input(shape = (34,21))
    x = tf.keras.layers.Flatten()(pep)
    y = tf.keras.layers.Flatten()(MHC)
    combined = tf.keras.layers.concatenate([x,y])

    z = tf.keras.layers.Dense(400,activation = "relu")(combined)
    z = tf.keras.layers.Dense(200,activation = "relu")(z)
    z = tf.keras.layers.Dense(1,activation = "sigmoid")(z)
    
    model = tf.keras.Model(inputs = [pep,MHC],outputs = z)

    return model
DNN = DNN()

def CNN():
    pep = tf.keras.Input(shape = (21,21,1))
    MHC = tf.keras.Input(shape = (34,21,1))

    x = tf.keras.layers.Conv2D(filters=16, kernel_size=(2, 12))(pep)  # 9
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.activations.relu(x)
    x = tf.keras.layers.Conv2D(filters=32, kernel_size=(2, 1))(x)    # 8
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.activations.relu(x)
    x = tf.keras.layers.MaxPool2D(pool_size=(2, 1), strides=(2, 1))(x)  # 4
    x = tf.keras.layers.Flatten()(x)
    x = tf.keras.Model(inputs=pep, outputs=x)

    y = tf.keras.layers.Conv2D(filters=16, kernel_size=(15, 12))(MHC)     # 32
    y = tf.keras.layers.BatchNormalization()(y)
    y = tf.keras.activations.relu(y)
    y = tf.keras.layers.MaxPool2D(pool_size=(2, 1), strides=(2, 1))(y)  # 16
    y = tf.keras.layers.Conv2D(filters=32,kernel_size=(9,1))(y)    # 8
    y = tf.keras.layers.BatchNormalization()(y)
    y = tf.keras.activations.relu(y)
    y = tf.keras.layers.MaxPool2D(pool_size=(2, 1),strides=(2,1))(y)  # 4
    y = tf.keras.layers.Flatten()(y)
    y = tf.keras.Model(inputs=MHC,outputs=y)

    combined = tf.keras.layers.concatenate([x.output,y.output])
    z = tf.keras.layers.Dense(128,activation='relu')(combined)
    z = tf.keras.layers.Dense(1,activation='sigmoid')(z)

    model = tf.keras.Model(inputs=[pep,MHC],outputs=z)
    return model

CNN = CNN()
callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)#change
Loss = tf.keras.losses.BinaryCrossentropy()
Optimizer = tf.keras.optimizers.Adam(learning_rate = 0.00008) 
Metrics = [tf.keras.metrics.AUC(),tf.keras.metrics.BinaryAccuracy(),tf.keras.metrics.AUC(curve = "PR"),tf.keras.metrics.Precision()] #add ROC&PR auc &PPV in metrics
Batch_size= 200
Epochs= 150
Verbose = 2 

DNN.compile(
        loss = Loss,
        optimizer = Optimizer,
        metrics = Metrics)
pep_Train,pep_Test,MHC_Train,MHC_Test,label_Train,label_Test = train_test_split(peptide,MHC,labels,test_size=0.1,random_state=202209,stratify=labels)
history = DNN.fit([pep_Train,MHC_Train],label_Train,
        batch_size=Batch_size,epochs=Epochs,
        validation_data = ([pep_Test,MHC_Test],label_Test),
        verbose = Verbose,
        callbacks = callback)

benchmark_Data = pd.read_csv("/public/slst/home/wanggsh/Data/MHCII_immuno/kmer_data.csv")
benchmark_Data["pep_blosum"] = benchmark_Data["pep"].apply(blosum62,args=(21,))
benchmark_Data["MHC_blosum"] = benchmark_Data["sequence"].apply(blosum62,args=(34,))
bench_pep = np.empty((len(benchmark_Data),21,21))
for i in range(len(benchmark_Data)):
    bench_pep[i] = benchmark_Data["pep_blosum"][i].reshape((21,21))
#mhc
bench_MHC = np.empty((len(benchmark_Data),34,21))
for i in range(len(benchmark_Data)):
    bench_MHC[i] = benchmark_Data["MHC_blosum"][i].reshape((34,21))
prediction = DNN.predict([bench_pep,bench_MHC])

benchmark_Data["prediction"] = prediction
benchmark_Data.pop("pep_blosum")
benchmark_Data.pop("MHC_blosum")
benchmark_Data.to_csv("/public/slst/home/wanggsh/Data/MHCII_immuno/kmer_DNN_result.csv")
CNN.compile(
        loss = Loss,
        optimizer = Optimizer,
        metrics = Metrics)
pep_Train,pep_Test,MHC_Train,MHC_Test,label_Train,label_Test = train_test_split(peptide_CNN,MHC_CNN,labels_CNN,test_size=0.1,random_state=202209,stratify=labels)
history = CNN.fit([pep_Train,MHC_Train],label_Train,
        batch_size=Batch_size,epochs=Epochs,
        validation_data = ([pep_Test,MHC_Test],label_Test),
        verbose = Verbose,
        callbacks = callback)

bench_pep_CNN = np.empty((len(benchmark_Data),21,21,1))
for i in range(len(benchmark_Data)):
    bench_pep_CNN[i] = benchmark_Data["pep_blosum"][i].reshape((21,21,1))
#mhc
bench_MHC_CNN = np.empty((len(benchmark_Data),34,21,1))
for i in range(len(benchmark_Data)):
    bench_MHC_CNN[i] = benchmark_Data["MHC_blosum"][i].reshape((34,21,1))
prediction = CNN.predict([bench_pep_CNN,bench_MHC_CNN])
benchmark_Data["prediction"] = prediction
benchmark_Data.pop("pep_blosum")
benchmark_Data.pop("MHC_blosum")
benchmark_Data.to_csv("/public/slst/home/wanggsh/Data/MHCII_immuno/kmer_CNN_result.csv")
#CNN result
peptide = pd.read_csv("../data/peptide.csv")
result = pd.read_csv("../data/15mer_CNN_result.csv")
result_15mer_CNN = IMM_process(result,peptide)
result = auc_pr(result_15mer_CNN,"Immunogenicity","prediction")
method_type.append("CNN")
mean_15mer.append(result[4])
result = pd.read_csv("../data/kmer_CNN_result.csv")
result_15mer = IMM_process(result,peptide)
result = auc_pr(result_15mer,"Immunogenicity","prediction")
mean_kmer.append(result[4])
#DNN result
result = pd.read_csv("../data/15mer_DNN_result.csv")
result_15mer_DNN = IMM_process(result,peptide)
result = auc_pr(result_15mer_DNN,"Immunogenicity","prediction")
method_type.append("DNN")
mean_15mer.append(result[4])
result = pd.read_csv("../data/kmer_DNN_result.csv")
result_15mer = IMM_process(result,peptide)
result = auc_pr(result_15mer,"Immunogenicity","prediction")
mean_kmer.append(result[4])
DF = pd.DataFrame({"method":method_type,"mer15":mean_15mer,"kmer":mean_kmer})

Result

import seaborn as sns
from matplotlib import cm
n = 10
x = np.arange(0, n, step=.5)
y = x ** 2
norm = plt.Normalize(y.min(), y.max())
norm_y = norm(y)
map_vir = cm.get_cmap(name='viridis')
color = map_vir(norm_y)
fig,ax = plt.subplots(figsize = (4,6))
plt.bar(x = DF["method"],height = DF["mer15"],width =0.95,color = color[10:])
#> <BarContainer object of 7 artists>
ax.set_xlabel("")
#> Text(0.5, 0, '')
ax.set_ylabel("AUC",fontsize = 14)
#> Text(0, 0.5, 'AUC')
ax.set_ylim([0.5,0.75])
#> (0.5, 0.75)
plt.xticks(rotation = 90)
#> ([0, 1, 2, 3, 4, 5, 6], [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])
plt.tight_layout()
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
plt.savefig("../figure/different_methods_15mer.pdf",dpi = 300,bbox_inches='tight',transparent=True)
plt.show()

fig,ax = plt.subplots(figsize = (4,6))
plt.bar(x = DF["method"],height = DF["kmer"],width =0.95,color = color[10:])
#> <BarContainer object of 7 artists>
ax.set_xlabel("")
#> Text(0.5, 0, '')
ax.set_ylabel("AUC",fontsize = 14)
#> Text(0, 0.5, 'AUC')
ax.set_ylim([0.5,0.75])
#> (0.5, 0.75)
plt.xticks(rotation = 90)
#> ([0, 1, 2, 3, 4, 5, 6], [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])
plt.tight_layout()
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
plt.savefig("../figure/different_methods_kmer.pdf",dpi = 300,bbox_inches='tight',transparent=True)
plt.show()

TLimmuno2 outperforms DNN, CNN and traditional machine learning methods, proving that our model structure can better fit the problem of immunogenicity prediction.

TLimmuno2 can learn important features that determine immunogenicity

Immunotherapy

To further validate TLimmuno2 and demonstrate the value of our model as a knowledge discovery tool, we evaluated the physiological relevance of TLimmuno2’s prediction. Most current cancer vaccine platforms preferentially screen candidate neoepitopes for vaccine production by selecting highly expressed candidate neoepitopes with high binding affinity to HLA alleles. However, despite rigorous candidate selection, many vaccine peptides do not elicit T-cell responses after vaccination. Therefore, we compare the performance of TLimmuno2 and NetMHCIIpan by using personalized melanoma neoepitopes with corresponding immune response data

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['pdf.fonttype'] = 42
mpl.rcParams['ps.fonttype'] = 42
plt.rcParams.update({'font.family':'Arial'})
from sklearn.metrics import accuracy_score, recall_score, f1_score, precision_score
def IMM_process_rank(Data,peptide):
    sam_pep_ID = Data["pep_ID"].unique()
    res = []
    p_ID = []
    Data["length"] = Data["pep"].map(len)
    for a in sam_pep_ID:
        Mean = []    
        ID = Data[Data["pep_ID"] == a]
        for i in ID["HLA"].unique():
            Data_HLA = ID[ID["HLA"] == i]
            prediction = []
            for l in Data_HLA["length"].unique():
                Data_len = Data_HLA[Data_HLA["length"] == l]
                pre = Data_len["Rank"].min()
                prediction.append(pre)
            x_mean = np.min(prediction)
            Mean.append(x_mean)
        res.append(min(Mean))
        p_ID.append(a)
    result = pd.DataFrame({"pep_ID":p_ID,"prediction":res})
    result = result.merge(peptide,how = "inner",on = "pep_ID" )
    return result
def netMHCIIpan_process(result,allele,ori_sorted,peptide):
    #combine two type allele
    result = pd.merge(result,allele,left_on="allele",right_on="Allele Name")
    result_sorted = result.sort_values(by=["peptide","HLA"]).reset_index(drop = True)
    #combine with ori data
    result_combined = pd.merge(result_sorted,ori_sorted[["pep","pep_ID"]],left_on = "peptide",right_on ="pep")
    #result_combined["pep_length"] = result_combined["peptide"].map(len)
    sam_pep_ID = result_combined["pep_ID"].unique()
    res = []
    p_ID = []
    for a in sam_pep_ID:
        Mean = []
        ID = result_combined[result_combined["pep_ID"] == a]
        #get Mean of all splited pep and Max of all HLA
        for i in ID["Allele Name"].unique():
            Data_HLA = ID[ID["Allele Name"] == i]
            prediction = []
            for l in Data_HLA["pep_length"].unique():
                Data_len = Data_HLA[Data_HLA["pep_length"] == l]
                pre = Data_len["percentile_rank"].min()
                prediction.append(pre)
            x = np.min(prediction)
        res.append(np.min(x))
        p_ID.append(a) 
    final_result = pd.DataFrame({"pep_ID":p_ID,"prediction":res})
    final_result = final_result.merge(peptide,how = "inner",on = "pep_ID" )
    return final_result 
patient_id = ["O.1","O.2","O.3","O.4","O.5","O.6"]
DF15 = pd.read_csv("../data/DF15.csv")
result_15 = pd.read_csv("../data/15mer_rank_result.csv")
peptide= pd.read_csv("../data/peptide.csv")
Rank_15mer = IMM_process_rank(result_15,peptide)
DF15_sorted = DF15.sort_values(by=["pep","HLA"]).reset_index(drop = True)
allele = pd.read_csv("../data/allele.csv")
netMHCIIpan_ba = pd.read_csv("../data/netMHCIIpan_15_mer_ba.csv")
netMHCIIpan_ba["pep_length"] = netMHCIIpan_ba["peptide"].map(len)
net_15mer_ba = netMHCIIpan_process(netMHCIIpan_ba,allele,DF15_sorted,peptide)
netMHCIIpan_el = pd.read_csv("../data/netMHCIIpan_15_mer_el.csv")
netMHCIIpan_el["pep_length"] = netMHCIIpan_el["peptide"].map(len)
net_15mer_el = netMHCIIpan_process(netMHCIIpan_el,allele,DF15_sorted,peptide)
ott_net_ba = net_15mer_ba[net_15mer_ba["Patient_ID"].map(lambda x : x in patient_id)]
ott_net_el = net_15mer_el[net_15mer_el["Patient_ID"].map(lambda x : x in patient_id)]
ott_imm = Rank_15mer[Rank_15mer["Patient_ID"].map(lambda x : x in patient_id)]
def call_value(data,net = True):
  if net:
    preds = data["prediction"].map(lambda x : 1 if x <=2 else 0)
  else:
    preds = data["prediction"].map(lambda x : 1 if x <=0.02 else 0)
  trues = data["Immunogenicity"]
  acc = accuracy_score(trues, preds)
  p = precision_score(trues, preds)
  r = recall_score(trues, preds)
  f1= f1_score(trues, preds)
  
  return acc,p,r,f1
#top20&top50
from collections import Counter
ott_imm_sorted = ott_imm.sort_values(by="prediction")
ott_net_el_sorted = ott_net_el.sort_values(by="prediction")
ott_net_ba_sorted = ott_net_ba.sort_values(by="prediction")
ott_imm_sorted["pre_label"] = ott_imm_sorted["prediction"].map(lambda x : 1 if x <=0.02 else 0)
ott_net_el_sorted["pre_label"] = ott_net_el_sorted["prediction"].map(lambda x : 1 if x <=2 else 0)
ott_net_ba_sorted["pre_label"] = ott_net_ba_sorted["prediction"].map(lambda x : 1 if x <=2 else 0)
def call_top(x):
  x20 = x[0:20]
  top20 = Counter(x20["Immunogenicity"] == x20["pre_label"])[True]
  x50 = x[0:50]
  top50 = Counter(x50["Immunogenicity"] == x50["pre_label"])[True]
  
  return top20,top50
imm20,imm50 = call_top(ott_imm_sorted)
net_ba20,net_ba50 = call_top(ott_net_ba_sorted)
net_el20,net_el50 = call_top(ott_net_el_sorted)
top20 = [imm20,net_ba20,net_el20]
top50 = [imm50,net_ba50,net_el50]
imm = call_value(ott_imm,False)
net_ba = call_value(ott_net_ba)
net_el = call_value(ott_net_el)
data = pd.DataFrame([imm,net_ba,net_el], index = ["TLimmuno2","NetMHCIIpan_BA","NetMHCIIpan_EL"],columns = ["Accuracy","Precision","Recall","F1-score"])
data["Method"] = data.index
data = data.melt(id_vars = "Method")
fig,ax = plt.subplots(figsize = (6,4))
sns.barplot(x = "variable",y = "value", hue = "Method",data = data,palette = "Set2")
#> <AxesSubplot:xlabel='variable', ylabel='value'>
ax.set_xlabel("")
#> Text(0.5, 0, '')
ax.set_ylabel("Score",fontsize = 14)
#> Text(0, 0.5, 'Score')
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
plt.savefig("../figure/neoantigen_vaccine.pdf",dpi = 300,transparent=True)
plt.show()

data = pd.DataFrame([top20,top50], columns = ["TLimmuno2","netMHCIIpan_ba","netMHCIIpan_el"],index = ["Top20","Top50"])
data["Method"] = data.index
data = data.melt(id_vars = "Method")
fig,ax = plt.subplots(figsize = (3,4))
sns.barplot(x = "Method",y = "value", hue = "variable",data = data,palette = "Set2")
#> <AxesSubplot:xlabel='Method', ylabel='value'>
ax.set_xlabel("")
#> Text(0.5, 0, '')
ax.get_legend().remove()
ax.set_ylabel("Immunogenic peptides",fontsize = 14)
#> Text(0, 0.5, 'Immunogenic peptides')
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
plt.savefig("../figure/neoantigen_top20&50.pdf",dpi = 300,bbox_inches='tight',transparent=True)
plt.show()

As shown in figure, TLimmuno2 performs better than the binding affinity model in accuracy, recall and F1 score by using 2% rank as a cut-off. The TopK metric (K=20 or 50, the number of positive samples in the top K samples with the lowest predicted ranks) is also used to evaluate the performance of different models, and TLimmuno2 shows improved performance compared with netMHCIIpan BA in these TopK metrics.

Immunoediting

The interactions between immune cells and tumor cells are reflected as immunoediting, which could mediate the negative selection of DNA alterations encoding high immunogenicity. The status of MHC class II neoantigen mediated negative selection in cancer evolution is still unknown. We next applied the TLimmuno2 to predict the immunogenicity of neopeptides arising from missense mutation in TCGA samples.

We then detect the immunoediting signals in TCGA samples by using the previously published CCF enrichment score (ESccf) method

we download 33 cancer type somatic mutation file form ucsx xena, we focus on “MuTect2 Variant Aggregation and Masking”. here the detail of dataprocess:

#R
TCGA_data = read_csv("../data/cancer_type.csv")
#> Rows: 33 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): Cancer
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
DT::datatable(TCGA_data,rownames = FALSE)
  1. from maf to mut_pep:
#python
#PBS_batch
import pandas as pd
import numpy as np
import os
import re
import sys
file = sys.argv[1]
maf_file_path =  "/public/slst/home/wanggsh/TCGA-process/Maf"
def maf2vcf(MAF_file,tumor_file_path):
    for i in MAF_file["Tumor_Sample_Barcode"].unique():
        split_maf_file = MAF_file[MAF_file["Tumor_Sample_Barcode"] == i]
        split_maf_file.to_csv("{0}/Tmp/{1}.maf".format(tumor_file_path,i),sep="\t",index=False)
        # maf2vcf comd
        comd1 ="""
        perl /public/home/liuxs/anaconda3/envs/neo/bin/maf2vcf.pl \
        --input-maf "{0}/Tmp/{1}.maf" \
        --output-dir {0}/VCF \
        --output-vcf {0}/VCF/{1}.vcf \
        --ref-fasta /public/home/liuxs/biodata/reference/genome/hg38/hg38.fa
        """.format(tumor_file_path,i)
        os.system(comd1)
def vcf2pep(tumor_file_path):
    for i in os.listdir("{0}/VCF".format(tumor_file_path)):
        print("Get pep from vcf in sample {0}".format(i))
        os.system("Rscript /public/slst/home/wanggsh/TCGA-process/vcf2pep.R {0}/VCF/{1} {0}/Pep/{1}.csv".format(tumor_file_path,i))
MAF_file = pd.read_table("{0}/{1}".format(maf_file_path,file),comment="#")
tumor_type = re.split("\.",file)[0]
file_path = "/public/slst/home/wanggsh/TCGA-process/Res/{0}".format(tumor_type)
os.system("mkdir {0}".format(file_path))
os.system("mkdir {0}/Tmp".format(file_path))
os.system("mkdir {0}/VCF".format(file_path))
print("maf to vcf in sample {0}".format(file))
maf2vcf(MAF_file,file_path)
os.system("rm {0}/VCF/*.tsv".format(file_path))
os.system("mkdir {0}/Pep".format(file_path))
vcf2pep(file_path)

here are vcf2pep.R script, we only choose missense mutation to get peptide sequence.

#vcf2pep.R
#code from WuTao
rary(tidyverse)
library(stringr)
args = commandArgs(trailingOnly = TRUE)

print(args)
annovar_path = "/public/slst/home/wanggsh/biosoft/annovar"


vcf2annova <- function(annovar_path,vcf_path,out_file,
                      need_allsamples=TRUE,need_samples){

  need_allsamples <- match.arg(as.character(need_allsamples),choices = c("TRUE","FALSE"))

  if (need_allsamples){
    commond <- paste0(annovar_path,"/convert2annovar.pl -format vcf4 ",vcf_path," -outfile ",out_file," -allsample -withfreq")
    system(command = commond)
  }else{
    commond <- paste0(annovar_path,"/convert2annovar.pl -format vcf4 ",vcf_path," -outfile ",out_file," -allsample")
    system(command = commond)
    need_samples_file <- paste0("cat ",out_file,".",need_samples,".avinput >> ",out_file)
    system(need_samples_file)
  }
}
vcf2pep <- function(annovar_path,vcf_path,
                    genome_version=c("hg19","hg38"),need_allsamples=TRUE,need_samples,num_thread){
  need_allsamples <- match.arg(as.character(need_allsamples),choices = c("TRUE","FALSE"))
  genome_version <- match.arg(genome_version)
  temp_file <- tempfile()
  temp_dir <- tempdir()
  file.create(temp_file)
  vcf2annova(annovar_path = annovar_path,vcf_path = vcf_path,out_file = temp_file,need_allsamples = need_allsamples,
             need_samples = need_samples)

  mut <- read.table(temp_file)
  mut <- mut[,1:5]
  colnames(mut) <- c("chr","start","end","ref","alt")
  comm_annotate <- paste0(annovar_path,"/table_annovar.pl ",temp_file," ",annovar_path,"/humandb/",
                          " -build ",genome_version,
                          " --outfile ",temp_dir,"/myanno",
                          ' -protocol refGene -operation g  --codingarg "-includesnp -onlyAltering"',
                          " --thread ",num_thread)
  system(comm_annotate)

  dt <- seqinr::read.fasta(file = paste0(temp_dir,"/myanno.refGene.fa"),
                           seqtype = "AA",as.string = TRUE,whole.header = T)

  variants <- names(dt)[seq(2,length(dt),by=2)]
  variants <- variants[grepl("protein-altering",variants)]
  pos_alter <- stringr::str_extract_all(variants,"p[.].+ protein-altering") %>% gsub(" protein-altering","",.)
  cdna <- stringr::str_extract_all(variants,"c[.].+ p[.]") %>% gsub(" p[.]","",.)
  lines <- stringr::str_extract_all(variants,"line.+ NM_") %>% gsub(" NM_","",.) %>%
    gsub("line","",.) %>% as.numeric()
  ENST <- stringr::str_extract_all(variants,"line.+ NM_[0-9]+") %>% gsub("line.+ ","",.)##extract transcripts

  mut_need <- mut[lines,]
  pep_seq_mt <- dt[sapply(variants,function(x){which(names(dt)==x)}) %>% unname()] %>% as.character()
  pep_seq_wt <- dt[(sapply(variants,function(x){which(names(dt)==x)}) %>% unname())-1] %>% as.character()
  mut_need$seq_wt <- pep_seq_wt
  mut_need$seq_mt <- pep_seq_mt
  mut_need$pos_alter <- pos_alter
  mut_need$cdna <- cdna
  mut_need$transcript <- ENST
  files_exist <- c(list.files(temp_dir,pattern = gsub(paste0(temp_dir,"/"),"",temp_file),full.names = T),
                   list.files(temp_dir,pattern = "myanno",full.names = T))##remove all temp files
  file.remove(files_exist)
  return(mut_need)
}

extractSeq <- function(seq,pos,len,indel=FALSE){
  indel <- match.arg(as.character(indel),choices = c("TRUE","FALSE"))
  if (pos < len){
    ext_seq <- substr(seq,1,(pos+len-1))
  }else if ((nchar(seq)-pos)<(len-1)){
    ext_seq <- substr(seq,(pos-len+1),nchar(seq))
  }else{
    ext_seq <- substr(seq,(pos-len+1),(pos+len-1))
  }

  if (indel){
    if (pos < len){
      ext_seq <- substr(seq,1,nchar(seq))
    }else{
      ext_seq <- substr(seq,(pos-len+1),nchar(seq))
    }
  }
  return(ext_seq)
}
vcf2seq <- function(annovar_path,vcf_path,
                    genome_version=c("hg19","hg38"),need_allsamples=TRUE,need_samples,len,num_thread){

  pep <- vcf2pep(annovar_path = annovar_path,vcf_path = vcf_path,
                 genome_version = genome_version,need_allsamples = need_allsamples,
                 need_samples = need_samples,num_thread=num_thread)

  pep <- pep %>%
    dplyr::mutate(pos = stringr::str_extract(pos_alter,"[0-9]+") %>% as.numeric()) %>%
    dplyr::rowwise() %>%
    dplyr::mutate(indel=ifelse(ref != "-" & alt != "-" & nchar(ref) == nchar(alt),"FALSE","TRUE")) %>%
    as.data.frame()
  ext_seqs_mt <- mapply(extractSeq,seq=pep$seq_mt,pos=as.numeric(pep$pos),
                        len=as.numeric(len),indel=pep$indel) %>% unname()
  ext_seqs_wt <- mapply(extractSeq,seq=pep$seq_wt,pos=as.numeric(pep$pos),
                        len=as.numeric(len),indel=pep$indel) %>% unname()
  ##remove the last star
  a <-  substr(ext_seqs_mt,nchar(ext_seqs_mt),nchar(ext_seqs_mt))=="*"
  b <-  substr(ext_seqs_wt,nchar(ext_seqs_wt),nchar(ext_seqs_wt))=="*"
  ext_seqs_mt[a] <- gsub("\\*","",ext_seqs_mt[a])
  ext_seqs_wt[b] <- gsub("\\*","",ext_seqs_wt[b])

  pep$ext_seqs_wt <- ext_seqs_wt
  pep$ext_seqs_mt <- ext_seqs_mt
  return(pep)
}
#vcf_path = "/public/slst/home/wanggsh/TCGA-process/Res/TCGA_ACC/VCF/TCGA-OR-A5J1-01A-11D-A29I-10.vcf"
#pep_file = "/public/slst/home/wanggsh/TCGA-process/Res/TCGA_ACC/Pep/result.csv"
vcf_path = args[1]
pep_file = args[2]
tumor_sample_barcode = unlist(strsplit(vcf_path,split ="/"))
tumor_sample_barcode = tumor_sample_barcode[length(tumor_sample_barcode)]
tumor_sample_barcode = unlist(strsplit(tumor_sample_barcode,split ="\\."))[1]
pep <- vcf2seq(annovar_path = annovar_path,vcf_path=vcf_path,genome_version = "hg38",need_allsamples =TRUE,len = 15,num_thread=1)


pep = pep[pep$indel== FALSE,]
pep = select(pep,-c("seq_wt","seq_mt"))
pep$tumor_sample_barcode = tumor_sample_barcode
write_csv(pep,pep_file,col_names = FALSE)
  1. split mutated peptides into 15mer subsequence:
#python
#PBS
import pandas as pd
import os
import re
file_path = "/public/slst/home/wanggsh/TCGA-process/Res"
tumor_type = os.listdir(file_path)
colnames = ["chr","start","end","ref","alt","pos_alter","cdna","transcript","pos","indel","ext_seqs_wt","ext_seqs_mt","tumor_sample_barcode"]
def mer15(Pep):
    P = []
    Length = 15
    for i in range(len(Pep) - Length +1):
        pep = Pep[i:i+Length]
        P.append(pep)
    return P
for i in tumor_type:
    DF = pd.DataFrame()
    pep_file_path = "/public/slst/home/wanggsh/TCGA-process/Res/{0}/Pep".format(i)
    pep_file = os.listdir(pep_file_path)
    for x in pep_file:
        tumor_barcode = re.split("\.",x)[0]
        pep = pd.read_csv("{0}/{1}".format(pep_file_path,x),names = colnames)
        pep = pep[["chr","start","end","ref","alt","ext_seqs_mt","tumor_sample_barcode"]]
        pep = pep.drop_duplicates()
        pep["length"] = pep["ext_seqs_mt"].map(len)
        pep = pep.query("length >=15") 
        Pep15 = []
        Pep_num = []
        Chrom = []
        Start = []
        End = []
        Refer = []
        Alt = []
        pep = pep.reset_index(drop=True)
        for y in range(len(pep)):
            res = mer15(pep["ext_seqs_mt"][y])
            pep_num = ["Pep{0}".format(y+1)] *len(res)
            chrom = [pep["chr"][y]] *len(res)
            start = [pep["start"][y]] *len(res)
            end = [pep["end"][y]] *len(res)
            refer = [pep["ref"][y]] *len(res)
            alt = [pep["alt"][y]] *len(res)
            Pep15.extend(res)
            Pep_num.extend(pep_num)
            Chrom.extend(chrom)
            Start.extend(start)
            End.extend(end)
            Refer.extend(refer)
            Alt.extend(alt)
        df = pd.DataFrame({"chr":Chrom,"start":Start,"end":End,"ref":Refer,"alt":Alt,"ext_seqs_mt":Pep15,"Pep_num":Pep_num})
        df["tumor_sample_barcode"] = tumor_barcode
        DF = pd.concat([DF,df])
    print("{0} Finish".format(i))
    DF = DF.drop_duplicates()
    DF.to_csv("/public/slst/home/wanggsh/TCGA-process/mut_pep2/{0}_pep.csv".format(i))
  1. predicte peptide score by using TLimmuno2. HLA typing data was downloaded from Benchmarking HLA genotyping and clarifying HLA impact on survival in tumor immunotherapy study and transformed to the format required by the neoantigen prediction tools. we predicted HLA-DR immunogenicity score of mutated peptides’ subsequence.
#python
#PBS
import pandas as pd
import numpy as np
import tensorflow as tf
import sys
sys.path.append("/public/slst/home/wanggsh/Project/MHCII/utils")
from Blosum62 import blosum62
import os
import re
os.environ["CUDA_VISIBLE_DEVICES"]="-1" 
#load_model

#strategy = tf.distribute.MirroredStrategy() #use mutiple GPU
BA_model = tf.keras.models.load_model("/public/slst/home/wanggsh/Project/Saved_model/new_BA_model/model1")
BAmodel = tf.keras.models.Model(inputs = BA_model.input,outputs = BA_model.layers[-2].output)
IMM = tf.keras.models.load_model("/public/slst/home/wanggsh/Project/Saved_model/MHCII_immuno/model")
#mutant pep file

mut_pep_file_path ="/public/slst/home/wanggsh/TCGA-process/mut_pep2"

MHC_II = pd.read_csv("/public/slst/home/wanggsh/TCGA-process/TCGA-MHCII-allele.csv")
pseudo_seq = pd.read_table("/public/slst/home/wanggsh/Data/MHCII/pseudosequence.2016.all.X.dat",names=["hla","sequence"])
i = sys.argv[1]
tumor = re.sub("_pep.csv","",i)
print("{0} Start".format(tumor))
Data = pd.read_csv("{0}/{1}".format(mut_pep_file_path,i))
MHC_II_smal = MHC_II[MHC_II["project_id"] == tumor]
Data["Sample_Barcode"] = Data["tumor_sample_barcode"].map(lambda x : x[0:12])
Data = pd.merge(Data,MHC_II_smal[["Sample_Barcode","hla"]])
Data = pd.merge(Data,pseudo_seq)
Data["pep_blosum"] = Data["ext_seqs_mt"].apply(blosum62,args=(21,))
Data["MHC_blosum"] = Data["sequence"].apply(blosum62,args=(34,))
peptide = np.empty((len(Data),21,21))
for i in range(len(Data)):
    peptide[i] = Data["pep_blosum"][i].reshape((21,21))

MHC = np.empty((len(Data),34,21))
for i in range(len(Data)):
    MHC[i] = Data["MHC_blosum"][i].reshape((34,21))

BA = BAmodel.predict([peptide,MHC])
IMM_result = IMM.predict([peptide,MHC,BA])
Data["prediction"] = IMM_result
Data.pop("Unnamed: 0")
Data.pop("sequence")
Data.pop("pep_blosum")
Data.pop("MHC_blosum")
Data.to_csv("/public/slst/home/wanggsh/TCGA-process/Prediction/IMM2/{0}.csv".format(tumor))
print("{0} Finish".format(tumor))

4.Background peptide predict

#PBS
#python
import pandas as pd
import sys
import numpy as np
sys.path.append("/public/slst/home/wanggsh/Project/MHCII/utils")
from Blosum62 import blosum62
import os
#os.environ["CUDA_VISIBLE_DEVICES"]="-1" 
import tensorflow as tf
BA_model = tf.keras.models.load_model("/public/slst/home/wanggsh/Project/Saved_model/new_BA_model/model1")
BAmodel = tf.keras.models.Model(inputs = BA_model.input,outputs = BA_model.layers[-2].output)
IMM = tf.keras.models.load_model("/public/slst/home/wanggsh/Project/Saved_model/MHCII_immuno/model")
TCGA_allele = pd.read_csv("/public/slst/home/wanggsh/TCGA-process/TCGA-MHCII-allele.csv")
pseudo_seq2 = pd.read_feather("/public/slst/home/wanggsh/Data/pseudo_blosum62.feather")
IMM_bg_pep = pd.read_csv("/public/slst/home/wanggsh/Data/MHCII_immuno/IMM_bg_pep.csv")
IMM_bg_pep["pep_blosum"] = IMM_bg_pep["pep"].apply(blosum62,args=(21,))
inteset = np.intersect1d(TCGA_allele["hla"].values,pseudo_seq2["MHC"].values)
#np.setdiff1d(TCGA_allele["hla"].values,pseudo_seq2["MHC"].values)
for i in inteset:
    DF = pd.DataFrame()
    IMM_bg_pep["MHC"] = i
    x = pd.merge(IMM_bg_pep,pseudo_seq2[["MHC","MHC_blosum"]])
    DF = pd.concat([DF,x])
    peptide = np.empty((len(DF),21,21))
    for z in range(len(DF)):
        peptide[z] = DF["pep_blosum"][z].reshape((21,21))
    MHC = np.empty((len(DF),34,21))
    for z in range(len(DF)):
        MHC[z] = x["MHC_blosum"][z].reshape((34,21))
    BA = BAmodel.predict([peptide,MHC])
    IMM_result = IMM.predict([peptide,MHC,BA])
    DF["Prediction"] = IMM_result
    DF = DF.drop(["Unnamed: 0","pep_blosum","MHC_blosum"],axis=1)
    DF.to_csv("/public/slst/home/wanggsh/TCGA-process/back_ground_pep/{0}.csv".format(i))

5.Ranking: use background peptides to do rank

#PBS_batch
#python
import pandas as pd
import numpy as np
import tensorflow as tf
import sys
sys.path.append("/public/slst/home/wanggsh/Project/MHCII/utils")
from Blosum62 import blosum62
import os
import re
os.environ["CUDA_VISIBLE_DEVICES"]="-1" 

IMM_resul_path = "/public/slst/home/wanggsh/TCGA-process/Prediction/IMM"
background_path = "/public/slst/home/wanggsh/TCGA-process/back_ground_pep"
file = os.listdir(IMM_resul_path)
arg = sys.argv[1]
res = pd.read_csv("{}/{}".format(IMM_resul_path,arg))
DF = pd.DataFrame()
for y in res["hla"].unique():
    res_all = res[res["hla"]==y]
    back_ground = pd.read_csv("{}/{}.csv".format(background_path,y))
    back_ground = back_ground["Prediction"].values.tolist()
    Rank = []
    for I in res_all["prediction"].values:
        back_ground.append(I)
        rank = 1-(sorted(back_ground).index(back_ground[-1])+1)/90001
        Rank.append(rank)
        back_ground.pop()
    res_all["Rank"] = Rank
    DF = pd.concat([DF,res_all])
DF.to_csv("/public/slst/home/wanggsh/TCGA-process/Prediction/IMM_rank/{0}_rank.csv".format(arg))
#python
#PBS
import pandas as pd
import matplotlib.pyplot as plt
import os
import re
import numpy as np

file_path = "/public/slst/home/wanggsh/TCGA-process/Prediction/IMM_rank2"
file = os.listdir(file_path)

Barcode = []
Rank = []
for i in file:
    Data = pd.read_csv("{}/{}".format(file_path,i))
    tumor = re.split(".csv_rank.csv",i)[0]
    Barcode.append(tumor)
    Rank.append(Data["Rank"])
Rank = np.array(Rank)
np.save("/public/slst/home/wanggsh/TCGA-process/Prediction/TCGA_rank.npy",Rank)
barcode = []
for i in Barcode:
    x = re.split("TCGA_",i)[1]
    barcode.append(x)
fig,axs = plt.subplots(figsize = (12,6))
plt.boxplot(Rank,labels=barcode,showcaps=False)
plt.xticks(rotation = 270)
plt.hlines(0.02,colors="red",xmin=0.9,xmax=33.1)
plt.ylabel("Prediction rank")
plt.savefig("/public/slst/home/wanggsh/TCGA-process/TCGA-rank.png",dpi = 300,bbox_inches='tight',transparent=True)

Barcode = []
Prediction = []
for i in file:
    Data = pd.read_csv("{}/{}".format(file_path,i))
    tumor = re.split(".csv_rank.csv",i)[0]
    Barcode.append(tumor)
    Prediction.append(Data["prediction"])
Prediction = np.array(Prediction)
np.save("/public/slst/home/wanggsh/TCGA-process/Prediction/TCGA_prediction.npy",Prediction)

fig,axs = plt.subplots(figsize = (12,6))
plt.boxplot(Prediction,labels=barcode,flierprops = dict(marker='o', markersize=2,linestyle='none'),showcaps=False,showfliers = False)
plt.xticks(rotation = 270)
plt.ylabel("Prediction score")
plt.savefig("/public/slst/home/wanggsh/TCGA-process/TCGA-prediction.png",dpi = 300,bbox_inches='tight',transparent=True)

fig,axs = plt.subplots(figsize = (12,6))
plt.boxplot(Prediction,labels=barcode,flierprops = dict(marker='o', markersize=2,linestyle='none'),showcaps=False)
plt.xticks(rotation = 270)
plt.ylabel("Prediction score")
plt.savefig("/public/slst/home/wanggsh/TCGA-process/TCGA-prediction-with-Discretevalues.png",dpi = 300,bbox_inches='tight',transparent=True)

TCGA-Rank

TCGA-prediction-score-withoutdiscretevalue

TCGA-prediction-score

By comparing prediction score and rank value in each TCGA samples, we found that most scores are less than 0.1 and the median rank value is around 0.5. It suggest that most mutant peptides are non-immunogenic.

Compare the different of MHCI BA ,MHCII BA and Immunogenicity

To evaluate the extent to which the TLimmuno2 learned a signal that is also learned by the binding affinity predictor, we measured the Pearson correlation between immunogenicity rank and binding affinity rank.

  1. combine MHC I binding affinity rank value and MHC II immunogenicity rank value
setwd("~/Project/MHCII/Data/TCGA-MHC-I/")
files <- list.files("/home/data/NeoPredPipe_results/",full.names = T,pattern = "batch")
final_result = tibble()
for (i in 1:100) {
  tt <- data.table::fread(files[i],sep = "\t",fill = TRUE)
  tt <- tt %>% dplyr::select(c(1,4:7,11,12,22,24))
  colnames(tt) <- c("sample","chr","position","ref","alt","hla","pep","rank_el","rank_ba")
  #tt$hla <-str_replace(tt$hla,"\\:","")
  sample = c()
  chr = c()
  pos = c()
  Rank = c()
  HLA = c()
  for (x in unique(tt$sample)){
    tt_sample = tt[tt$sample == x, ]
    for (y in unique(tt_sample$chr)){
      tt_chr = tt_sample[tt_sample$chr == y,]
      for (z in unique(tt_chr$position)){
        tt_pos = tt_chr[tt_chr$position == z,]
        prediction = c()
        for (h in unique(tt_chr$hla)){
          tt_hla = tt_pos[tt_pos$hla == h,]
          pre = min(tt_hla$rank_ba)
          prediction = c(prediction,pre)
        }
        rank = min(prediction)
        sample = c(sample,x)
        chr = c(chr,y)
        pos = c(pos,z)
        HLA = c(HLA,h)
        Rank = c(Rank,rank)
      }
    }
  }
  res = tibble(sample,chr,pos,HLA,Rank)
  final_result = rbind(final_result,res)
  message("complete ", i)
}
saveRDS(final_result,"MHCI_BA.rds")
#MHCII_BA
files <- list.files("~/Project/MHCII/Data/TCGA/neoantigen_es",full.names = T)
res <- lapply(files,
                function(x){
                  read_csv(x)
                })
MHC_II_imm <- bind_rows(res)
files <- list.files("~/Project/MHCII/Data/TCGA/net_neoantigen_es",full.names = T)

res <- lapply(files,
              function(x){
                read_csv(x)
              })
MHC_II_BA <- bind_rows(res)
MHC_I_BA <- readRDS("MHCI_BA.rds")
MHC_I_BA$sample <- str_sub(MHC_I_BA$sample,end = 12)
MHC_I_BA <- MHC_I_BA %>% select(sample,chr,pos,Rank)%>% rename("MHCI_BA_Rank" = "Rank") %>% rename("sample_barcode" = "sample")
MHC_II_imm <- MHC_II_imm %>% select(Sample_Barcode,chr,start,Result)
colnames(MHC_II_imm) <- c("sample_barcode","chr","pos","MHCII_IMM_Rank")
MHC_II_BA <- MHC_II_BA %>% select(sample,chr,start,Result)
colnames(MHC_II_BA) <- c("sample_barcode","chr","pos","MHCII_BA_Rank")
MHCIBA_IMM <- left_join(MHC_I_BA,MHC_II_imm)
MHCIIBA_IMM <- left_join(MHC_II_BA,MHC_II_imm)
MHCIIBA_IMM$MHCII_BA_Rank <- MHCIIBA_IMM$MHCII_BA_Rank/100
saveRDS(MHCIBA_IMM,"MHCIBA_IMM.rds")
saveRDS(MHCIIBA_IMM,"MHCIIBA_IMM.rds")
  1. Calculating the Pearson correlation coefficient between Binding affinity and immunogenicity.
MHCIBA_IMM <- readRDS("../data/MHCIBA_IMM.rds")
MHCIIBA_IMM <- readRDS("../data/MHCIIBA_IMM.rds")
MHCIBA_IMM <- na.omit(MHCIBA_IMM)
MHCIIBA_IMM <- na.omit(MHCIIBA_IMM)
MHCIBA_IMM <- rename(MHCIBA_IMM,BA_Rank = MHCI_BA_Rank)
MHCIIBA_IMM <- rename(MHCIIBA_IMM,BA_Rank = MHCII_BA_Rank)
split_sample <- function(data) {
  r = c()
  p = c()
  for (i in unique(data$sample_barcode)) {
    sam <- data[data$sample_barcode == i,]
    if (nrow(sam)<=2) {
      next
    }
    
    cor_test = cor.test(sam$BA_Rank/100,sam$IMM_Rank)
    r <- c(r,cor_test$estimate)
    p <- c(p,cor_test$p.value)
  }
  r2 <- r*r
  
  padj <- p.adjust(p, method = "BH")
  x <- tibble(r,p,r2,padj)
  return(x)
}

MHCI <- split_sample(MHCIBA_IMM)
MHCI$Type <- "MHCI"
median(MHCI$r2)
#> [1] 0.09680865
MHCII <- split_sample(MHCIIBA_IMM)
MHCII$Type <- "MHCII"
median(MHCII$r2)
#> [1] 0.2057099
result <- rbind(MHCI,MHCII)
ggplot(result,aes(r,fill = Type)) + 
  geom_density(alpha = 0.6) + theme_bw() + theme(axis.line = element_line(colour = "black"),
                                             panel.grid.major = element_blank(),
                                             panel.grid.minor = element_blank(),
                                             panel.border = element_blank(),
                                             panel.background = element_blank())

ggsave("../figure/BA_IMM_r.pdf",dpi = 300,width = 6,height = 6)
ggplot(result,aes(r2,fill = Type)) + 
  geom_density(alpha = 0.6) + theme_bw()+ theme(axis.line = element_line(colour = "black"),
                                 panel.grid.major = element_blank(),
                                 panel.grid.minor = element_blank(),
                                 panel.border = element_blank(),
                                 panel.background = element_blank())+xlab(bquote(R^2))


ggsave("../figure/BA_IMM_r2.pdf",dpi = 300,width = 6,height = 6)
ggplot(result,aes(padj,fill = Type)) + 
  geom_density(alpha = 0.6) + theme_bw()+ theme(axis.line = element_line(colour = "black"),
                                                panel.grid.major = element_blank(),
                                                panel.grid.minor = element_blank(),
                                                panel.border = element_blank(),
                                                panel.background = element_blank())

ggsave("../figure/BA_IMM_padj.pdf",dpi = 300,width = 6,height = 6)

The correlations were positive, significant but low in magnitude, median Pearson r2 = 0.0968 in the compare between MHC I binding affinity rank and MHC II immunogenicity rank; and median Pearson r2 = 0.2057 in the compare between MHC II binding affinity rank and MHC II immunogenicity rank . Overall, this analysis suggested that the TLimmuno2 is at least partially non-redundant with the binding affinity predictor.

immunoediting signal detect

We analyzed all neoepitopes in TCGA using TLimmuno2 and set cutoff to 2% to obtain neoantigens.

#PBS_batch
#python
import os
import pandas as pd
import re
file_path3 = "/public/slst/home/wanggsh/TCGA-process/Prediction/IMM_rank2"
def get_neoantigen(Data):
    DF = pd.DataFrame()
    for i in Data["Sample_Barcode"].unique():
        Data_sample = Data[Data["Sample_Barcode"] == i]
        for x in  Data_sample["chr"].unique():
            Data_chr = Data_sample[Data_sample["chr"] == x]
            for start in Data_chr["start"].unique():
                Data_str = Data_chr[Data_chr["start"] == start]
                prediction = []
                for y in Data_str["hla"].unique():
                    Data_hla = Data_str[Data_str["hla"] == y]
                    pre = Data_hla["Rank"].min() 
                    prediction.append(pre)
                Data_str["Result"] = min(prediction)
                Data_final = Data_str[["tumor_sample_barcode","Sample_Barcode","chr","start","end","ref","alt","Result"]]
                Data_final = Data_final.drop_duplicates()
                DF = pd.concat([DF,Data_final])
    DF["label"] = DF["Result"].map(lambda x : 1 if x <0.02 else 0) 
    return DF
file = os.listdir(file_path3)
for i in file:
    tumor = re.split(".csv_rank.csv",i)[0]
    Data = pd.read_csv("{}/{}".format(file_path3,i))
    result = get_neoantigen(Data)
    result.to_csv("".format(tumor))

Then we can calculate and compare the median actual CCF-ES with median simulation CCF-ES in pan-cancer and cancer-type:

#R
ccf <- readRDS("all_mut_mis_ccf.rds")
files <- list.files("~/Project/MHCII/Data/TCGA/neoantigen_es/",full.names = TRUE)
res <- lapply(files,
              function(x){
                read.csv(x) %>% select(-X)
              })
mut_all <- bind_rows(res)
saveRDS(mut_all,file = "mut_all.rds")

mut_all <- mut_all %>%
  mutate(index=paste(substr(tumor_sample_barcode,1,16),
                     chr,start,ref,alt,sep = ":")) %>%
  select(index,Result,label)
all_mut_ccf <- inner_join(
  mut_all,
  ccf %>% select(-sample),
  by="index"
)
all_mut_ccf <- all_mut_ccf[!duplicated(all_mut_ccf$index),] 
all_mut_ccf$sample <- substr(all_mut_ccf$index,1,16)
all_mut_ccf <- all_mut_ccf %>% rename(gene = Hugo_Symbol)
TPM_log2 <- data.table::fread("tcga_RSEM_gene_tpm.gz")  #too large,not include
mapping <- data.table::fread("mapping_probe")
mapping <- mapping[,1:2]
TPM_log2_mapping <- left_join(TPM_log2 %>% rename(id=sample),mapping) %>%
  select(id,gene,everything())###log2(tpm+0.001)
TPM_log2_mapping <- as.data.frame(TPM_log2_mapping)
tpm_trans <- TPM_log2_mapping
tpm_trans[,3:ncol(tpm_trans)] <- apply(TPM_log2_mapping[,3:ncol(TPM_log2_mapping)],2,
                                       function(x){(2^x) - 0.001})
saveRDS(tpm_trans,file = "tpm_trans.rds")
tpm <- readRDS("tpm_trans.rds")
tpm <- tpm[!duplicated(tpm$gene),]

all_mut_ccf$tpm_exp <- mapply(function(sample,gene){
  tpm[tpm$gene==gene,substr(sample,1,15)]
},all_mut_ccf$sample,all_mut_ccf$gene)

all_mut_ccf_tpm <- all_mut_ccf %>%
  filter(lengths(tpm_exp)!=0)

all_mut_ccf_tpm$tpm_exp <- as.numeric(all_mut_ccf_tpm$tpm_exp)

all_mut_ccf_tpm <- all_mut_ccf_tpm %>%
  mutate(neo=ifelse(label==1 & tpm_exp>1,"neo","not_neo"))   #neo antigen& log expression >1

saveRDS(all_mut_ccf_tpm,file = "mut_ccf_tpm_neo.rds")
cal_nes_warp <- function(dt){
  results_ccf <- vector("list",length = length(unique(dt$sample)))
  names(results_ccf) <- unique(dt$sample)
  
  cl <- makeCluster(getOption("cl.cores", 15),type="FORK")
  results_ccf <- parSapply(cl=cl,names(results_ccf),
                           function(x){
                             data <- dt %>% filter(sample == x)
                             a <- NeoEnrichment::cal_nes_new_test(dt = data,
                                                                  sample_counts = 1000,
                                                                  need_p = FALSE)
                             return(a)
                           },simplify = FALSE)
  stopCluster(cl)
  results_ccf <- Filter(function(x){length(x)>1},results_ccf)
  pancancer_nes_ccf <- bind_rows(results_ccf)
  return(pancancer_nes_ccf)
}  #ES score

all_mut_ccf <- readRDS("../data/30_mut_ccf_tpm_neo.rds")
all_mut_ccf <- all_mut_ccf %>%
  rename(ccf=ccf_hat) %>%
  mutate(neo=ifelse(neo=="neo","yes","no"))

samples_has_subclonal <- all_mut_ccf %>%
  filter(ccf<0.6) %>%
  select(sample) %>%
  distinct(sample)   

all_mut_ccf  %>%
  filter(sample %in% samples_has_subclonal$sample) %>%
  group_by(sample) %>%
  summarise(c_n=sum(neo=="yes"),c_m=sum(neo=="no")) %>% filter(c_n>=1 & c_m >=1) -> summ

driver_mutations <- readRDS("../data/driver_mutations.rds")
#reduce driver mutation 
all_mut_ccf <- all_mut_ccf %>%
  mutate(gene_protein_change=paste(gene,Protein_Change,sep = "-"))
driver_mutations <- driver_mutations %>%
  mutate(gene_protein_change=paste(gene,protein_change,sep = "-"))

all_mut_ccf <- all_mut_ccf %>%
  mutate(is_driver=ifelse(gene_protein_change %in% driver_mutations$gene_protein_change,"yes","no"))

sum(all_mut_ccf$is_driver=="yes" & all_mut_ccf$neo=="yes") / sum(all_mut_ccf$neo=="yes")##0.01618271

all_mut_ccf %>% group_by(sample) %>%
  summarise(inter_gene=intersect(gene[neo=="yes"],
                                 gene[is_driver=="yes"])) -> aaa##701 samples
all_mut_ccf <- all_mut_ccf %>%
  mutate(sample_neo_index=paste(sample,neo,gene,sep = ","))
aaa <- aaa %>% mutate(sample_neo_index=paste(sample,"yes",inter_gene,sep = ","))

all_mut_ccf %>%
  mutate(in_aaa = ifelse(sample_neo_index %in% aaa$sample_neo_index,"yes","no")) %>%
  group_by(sample) %>%
  summarise(need_sample=ifelse(any(in_aaa=="yes"),"no","yes")) %>%
  filter(need_sample=="yes") -> summ2
need_samples <- intersect(samples_has_subclonal$sample,summ2$sample)
all_mut_ccf %>% filter(!is.na(ccf)) %>%
  filter(sample %in% need_samples) %>%
  group_by(sample) %>%
  summarise(c_n=sum(neo=="yes"),c_m=sum(neo=="no")) %>% filter(c_n>=1 & c_m >=1) -> summ
neo_missense <- all_mut_ccf %>% filter(sample %in% summ$sample)
neo_missense <- neo_missense %>% select(sample,neo,ccf) %>% filter(!is.na(ccf))

library(parallel)
neo_nes <- cal_nes_warp(neo_missense)
saveRDS(neo_nes,file = "../data/30_neo_es.rds")
#R
neo_nes <- readRDS("../data/30_neo_es.rds")
all_mut_ccf <- readRDS("../data/30_mut_ccf_tpm_neo.rds")
all_mut_ccf <- all_mut_ccf %>%
  rename(ccf=ccf_hat) %>%
  mutate(neo=ifelse(neo=="neo","yes","no"))
neo_missense <- all_mut_ccf %>%
  filter(sample %in% neo_nes$sample)
neo_missense <- neo_missense %>%
  select(sample,neo,ccf) %>%
  filter(!is.na(ccf))

cal_nes_warp <- function(dt){
  results_ccf <- vector("list",length = length(unique(dt$sample)))
  names(results_ccf) <- unique(dt$sample)
  
  cl <- makeCluster(getOption("cl.cores", 15),type="FORK")
  results_ccf <- parSapply(cl=cl,names(results_ccf),
                           function(x){
                             data <- dt %>% filter(sample == x)
                             a <- NeoEnrichment::cal_nes_new_test(dt = data,
                                                                  sample_counts = 1000,
                                                                  need_p = FALSE)
                             return(a)
                           },simplify = FALSE)
  stopCluster(cl)
  results_ccf <- Filter(function(x){length(x)>1},results_ccf)
  pancancer_nes_ccf <- bind_rows(results_ccf)
  return(pancancer_nes_ccf)
}
res <- vector("list",2000)
for (i in 1:2000){
  neo_missense_sim <- neo_missense %>%
    group_by(sample) %>%
    mutate(neo_sim=sample(neo,length(neo))) %>%
    ungroup()
  neo_missense_sim <- neo_missense_sim %>%
    select(-neo) %>%
    rename(neo=neo_sim)
  neo_nes_sim <- cal_nes_warp(neo_missense_sim)
  neo_nes_sim$sim_num <- i
  res[[i]] <- neo_nes_sim
  print(paste0("Complete ",i," sim. "))
}
saveRDS(res,file = "30_sim_2000_not_filter_driver.rds")

neo_nes <- readRDS("30_neo_es.rds")
sim_all <- readRDS("30_sim_2000_not_filter_driver.rds")
sim_all <- bind_rows(sim_all)
sim_all$cancer <- get_cancer_type(sim_all$sample,parallel = TRUE,cores = 20)
saveRDS(sim_all,file = "30_sim_add_cancer.rds")
#R
neo_nes <- readRDS("../data/neo_es.rds")
sim_all <- readRDS("../data/sim_add_cancer.rds")
sim_all %>%
  group_by(cancer,sim_num) %>%
  summarise(median_es=median(es)) -> summ2
#> `summarise()` has grouped output by 'cancer'. You can override using the
#> `.groups` argument.
neo_nes$cancer <- get_cancer_type(neo_nes$sample)
neo_nes_summ <- neo_nes %>%
  group_by(cancer) %>% summarise(median_es=median(es))
sim_all %>%
  group_by(sim_num) %>%
  summarise(median_es=median(es)) -> summ
p <- WVPlots::ShadedDensity(frame = summ,
                            xvar = "median_es",
                            threshold = median(neo_nes$es),
                            title = "",
                            tail = "left")

p$layers[[1]]$aes_params$colour <- "red"
p$layers[[1]]$aes_params$size <- 1
p$layers[[2]]$aes_params$fill <- "blue"     #geom_ribbon
p$layers[[3]]$aes_params$colour <- "black"
p$layers[[3]]$aes_params$size <- 1
#p$layers[[4]]$aes_params$label <- "Actual median ES" #geom_text
p1 <- p + labs(x="Simulation median es")+
  theme_prism()
p1

ggsave("../figure/null_dis.pdf")
#> Saving 8 x 5 in image
##cancer type
neo_nes_summ <- neo_nes_summ %>%
  rowwise() %>%
  mutate(p=mean(summ2$median_es[summ2$cancer==cancer] <= median_es))
get_f1 <- function(dt,pancancer_p,dt2,median_es){
  p1 <- ggplot(data=dt,aes(x=1,y=es))+
    geom_violin(alpha=0.7,width=0.5)+
    geom_boxplot(width=0.2)+
    theme_prism()+
    labs(x=paste0("median es = ",round(median_es,digits = 3),"\n n = ",nrow(dt)),y = "ESccf",size=5)+
    theme(axis.ticks.x = element_blank(),
          axis.text.x = element_blank())+
    theme(text = element_text(size = 10))+
    annotate(geom="text", x=1, y=1.1, label=paste0("p = ",pancancer_p),
             color="red",size=4)
  dt$cancer <- get_cancer_type(dt$sample)
  dt %>% group_by(cancer) %>%
    summarise(median_est=median(es),c=n()) %>%
    arrange(median_est) %>%
    mutate(label=paste0(cancer,"\n(n=",c,")"))-> summ1
  summ1 <- left_join(summ1,dt2 %>% select(cancer,p))
  summ1$p <- signif(summ1$p,digits = 1)
  summ1 <- summ1 %>%
    mutate(sig=case_when(
      p <= 0.05 & p > 0.01 ~ "*",
      p < 0.01 ~ "**",
      TRUE ~ "ns"
    ))
  dt <- left_join(dt,summ1)
  dt$label <- factor(dt$label,levels = summ1$label)
  df2 <- data.frame(x = 1:nrow(summ1), y = 1.1, family = summ1$sig)
  p2 <- ggplot(data=dt,aes(x=label,y=es))+
    geom_boxplot()+
    theme_prism()+
    theme(axis.title.x = element_blank())+
    theme(axis.text.x = element_text(angle = 45,vjust = 1, hjust = 1))+
    theme(text = element_text(size = 7))+
    geom_text(data=df2,aes(x=x,y=y,label = family))+
    geom_hline(yintercept=0,
               color = "red", size=1)+labs(y = "")
  p3 <- p1 + p2 + plot_layout(widths = c(1, 8))
}

p2 <- get_f1(neo_nes,pancancer_p = "< 0.0005",
             dt2 = neo_nes_summ,median_es = median(neo_nes$es))
#> Joining, by = "cancer"
#> Joining, by = "cancer"
p2

ggsave("../figure/immuno_edit.pdf",width = 12,height = 5,dpi = 300)

we also use 30% cutoff to define neoantigen and calculate ESccf score:

#R
neo_nes <- readRDS("../data/30_neo_es.rds")
sim_all <- readRDS("../data/30_sim_add_cancer.rds")
sim_all %>%
  group_by(cancer,sim_num) %>%
  summarise(median_es=median(es)) -> summ2
#> `summarise()` has grouped output by 'cancer'. You can override using the
#> `.groups` argument.
neo_nes$cancer <- get_cancer_type(neo_nes$sample)
neo_nes_summ <- neo_nes %>%
  group_by(cancer) %>% summarise(median_es=median(es))
sim_all %>%
  group_by(sim_num) %>%
  summarise(median_es=median(es)) -> summ
p <- WVPlots::ShadedDensity(frame = summ,
                            xvar = "median_es",
                            threshold = median(neo_nes$es),
                            title = "",
                            tail = "left")

p$layers[[1]]$aes_params$colour <- "red"
p$layers[[1]]$aes_params$size <- 1
p$layers[[2]]$aes_params$fill <- "blue"     #geom_ribbon
p$layers[[3]]$aes_params$colour <- "black"
p$layers[[3]]$aes_params$size <- 1
#p$layers[[4]]$aes_params$label <- "Actual median ES" #geom_text
p1 <- p + labs(x="Simulation median es")+
  theme_prism()
p1

ggsave("../figure/null_dis_30.pdf")
#> Saving 8 x 5 in image
##cancer type
neo_nes_summ <- neo_nes_summ %>%
  rowwise() %>%
  mutate(p=mean(summ2$median_es[summ2$cancer==cancer] <= median_es))
get_f1 <- function(dt,pancancer_p,dt2,median_es){
  p1 <- ggplot(data=dt,aes(x=1,y=es))+
    geom_violin(alpha=0.7,width=0.5)+
    geom_boxplot(width=0.2)+
    theme_prism()+
    labs(x=paste0("median es = ",round(median_es,digits = 3),"\n n = ",nrow(dt)),y = "ESccf",size=5)+
    theme(axis.ticks.x = element_blank(),
          axis.text.x = element_blank())+
    theme(text = element_text(size = 10))+
    annotate(geom="text", x=1, y=1.1, label=paste0("p = ",pancancer_p),
             color="red",size=4)
  dt$cancer <- get_cancer_type(dt$sample)
  dt %>% group_by(cancer) %>%
    summarise(median_est=median(es),c=n()) %>%
    arrange(median_est) %>%
    mutate(label=paste0(cancer,"\n(n=",c,")"))-> summ1
  summ1 <- left_join(summ1,dt2 %>% select(cancer,p))
  summ1$p <- signif(summ1$p,digits = 1)
  summ1 <- summ1 %>%
    mutate(sig=case_when(
      p <= 0.05 & p > 0.01 ~ "*",
      p < 0.01 ~ "**",
      TRUE ~ "ns"
    ))
  dt <- left_join(dt,summ1)
  dt$label <- factor(dt$label,levels = summ1$label)
  df2 <- data.frame(x = 1:nrow(summ1), y = 1.1, family = summ1$sig)
  p2 <- ggplot(data=dt,aes(x=label,y=es))+
    geom_boxplot()+
    theme_prism()+
    theme(axis.title.x = element_blank())+
    theme(axis.text.x = element_text(angle = 45,vjust = 1, hjust = 1))+
    theme(text = element_text(size = 7))+
    geom_text(data=df2,aes(x=x,y=y,label = family))+
    geom_hline(yintercept=0,
               color = "red", size=1)+labs(y = "")
  p3 <- p1 + p2 + plot_layout(widths = c(1, 8))
}

p2 <- get_f1(neo_nes,pancancer_p = "0.14",
             dt2 = neo_nes_summ,median_es = median(neo_nes$es))
#> Joining, by = "cancer"
#> Joining, by = "cancer"
p2

ggsave("../figure/immuno_edit_30.pdf",width = 12,height = 5,dpi = 300)

In TCGA pan-cancer cohort, when samples with antigenic and driver mutations lying on the same gene are removed, the observed median ESCCF is -0.024 (n = 5,524, P-value < 0.0005). By comparing simulated median ES distribution, several cancer types including adrenocortical carcinoma (ACC), Thymoma (THYM), Esophageal carcinoma (ESCA), Cervical squamous cell carcinoma and endocervical adenocarcinoma (CESC), Cervical squamous cell carcinoma and endocervical adenocarcinoma (LUAD) show significant low ESCCF values

sim_all <- readRDS("../data/net_sim_add_cancer.rds")
neo_nes <- readRDS("../data/net_neo_es.rds")
sim_all %>%
  group_by(cancer,sim_num) %>%
  summarise(median_es=median(es)) -> summ2
#> `summarise()` has grouped output by 'cancer'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'cancer'. You can override using the
#> `.groups` argument.
neo_nes$cancer <- get_cancer_type(neo_nes$sample)
neo_nes_summ <- neo_nes %>%
  group_by(cancer) %>% summarise(median_es=median(es))
sim_all %>%
  group_by(sim_num) %>%
  summarise(median_es=median(es)) -> summ
p <- WVPlots::ShadedDensity(frame = summ,
                            xvar = "median_es",
                            threshold = median(neo_nes$es),
                            title = "",
                            tail = "left")

p$layers[[1]]$aes_params$colour <- "red"
p$layers[[1]]$aes_params$size <- 1
p$layers[[2]]$aes_params$fill <- "blue"     #geom_ribbon
p$layers[[3]]$aes_params$colour <- "black"
p$layers[[3]]$aes_params$size <- 1
#p$layers[[4]]$aes_params$label <- "Actual median ES" #geom_text
p1 <- p + labs(x="Simulation median es")+
  theme_prism()
p1

ggsave("../figure/net_null_dis.pdf")
#> Saving 8 x 5 in image
##cancer type
neo_nes_summ <- neo_nes_summ %>%
  rowwise() %>%
  mutate(p=mean(summ2$median_es[summ2$cancer==cancer] <= median_es))
get_f1 <- function(dt,pancancer_p,dt2,median_es){
  p1 <- ggplot(data=dt,aes(x=1,y=es))+
    geom_violin(alpha=0.7,width=0.5)+
    geom_boxplot(width=0.2)+
    theme_prism()+
    labs(x=paste0("median es = ",round(median_es,digits = 3),"\n n = ",nrow(dt)),size=7)+
    theme(axis.ticks.x = element_blank(),
          axis.text.x = element_blank())+
    theme(text = element_text(size = 10))+
    annotate(geom="text", x=1, y=1.1, label=paste0("p = ",pancancer_p),
             color="red",size=4)+labs(y = "ESccf",size = 8)
  dt$cancer <- get_cancer_type(dt$sample)
  dt %>% group_by(cancer) %>%
    summarise(median_est=median(es),c=n()) %>%
    arrange(median_est) %>%
    mutate(label=paste0(cancer,"\n(n=",c,")"))-> summ1
  summ1 <- left_join(summ1,dt2 %>% select(cancer,p))
  summ1$p <- signif(summ1$p,digits = 1)
  summ1 <- summ1 %>%
    mutate(sig=case_when(
      p <= 0.05 & p > 0.01 ~ "*",
      p < 0.01 ~ "**",
      TRUE ~ "ns"
    ))
  dt <- left_join(dt,summ1)
  dt$label <- factor(dt$label,levels = summ1$label)
  df2 <- data.frame(x = 1:nrow(summ1), y = 1.1, family = summ1$sig)
  p2 <- ggplot(data=dt,aes(x=label,y=es))+
    geom_boxplot()+
    theme_prism()+
    theme(axis.title.x = element_blank())+
    theme(axis.text.x = element_text(angle = 45,vjust = 1, hjust = 1))+
    theme(text = element_text(size = 7))+
    geom_text(data=df2,aes(x=x,y=y,label = family))+
    geom_hline(yintercept=0,
               color = "red", size=1)+labs(y = "")
  p3 <- p1 + p2 + plot_layout(widths = c(1, 8))
  
}

p2 <- get_f1(neo_nes,pancancer_p = 0.0005,
             dt2 = neo_nes_summ,median_es = median(neo_nes$es))
#> Joining, by = "cancer"
#> Joining, by = "cancer"

p2

ggsave("../figure/net_immunoediting.pdf",width = 12,height = 5,dpi = 300)