Contact Prof. Xue-Song Liu: liuxs@shanghaitech.edu.cn

This report is written to help readers understand what and how we did in this project. Please read the formal manuscript TBD for more details.

This document is compiled from an Rmarkdown file which contains all code or description necessary to (auto-)reproduce the analysis for the accompanying project. Each section below describes a different component of the analysis and all numbers and figures are generated directly from the underlying data on compilation.

Data Preprocessing

This part focuses on data preprocessing, it contains the following sections:

Data downloading
Data preprocessing
- Pipeline
- Copy number calling
- Mutation calling
Data cleaning

Data downloading

This section describes where and how we downloaded the data.

WES data

Whole exome sequencing raw data were downloaded from the following 6 dbGap studies:

phs000178: this is TCGA prostate cancer cohort, the raw data is bam format stored in GDC portal, we access them by dbGap permission
phs000447: sra data
phs000554: sra data
phs000909: sra data
phs000915: sra data
phs001141: sra data

A recent study (Armenia et al. 2018) used 5 dbGap studies:

phs000178
phs000447
phs000909
phs000915
phs000945

Our study contains two new datasets phs000554 and phs001141 than Armenia et al. (2018), however, phs000945 was excluded from our study due to its unavailable status in dbGap database.

Therefore, to our knowledge, we included all available raw WES data in our study.

After selecting the raw WES data, we downloaded them by either sratools for dbGap data or gdc-client for TCGA data.

Phenotype data

Phenotype data of phs000447, phs000554, phs000909, phs000915 and phs001141 were downloaded along with the raw WES data. Phenotype data of TCGA prostate cancer cohort were downloaded from UCSC Xena by UCSCXenaTools (Wang and Liu 2019). The cleaning process is described in ‘Data cleaning’ section.

Survival data

Survival data of phs000554 were included in phenotype data described above. Survival data of TCGA prostate cancer cohort were downloaded from UCSC Xena by UCSCXenaTools (Wang and Liu 2019). The cleaning process is described in ‘Data cleaning’ section.

Data preprocessing

This section describes how we preprocessed the data.

Pipeline

The following diagram shows our upstream workflow. Steps before signature identification are belong to preprocessing. The preprocessing pipeline generates the variation records of samples.

Genomic variation signature identification pipeline of prostate cancer

Sequence alignment

For cohorts including phs000447, phs000554, phs000909, phs000915 and phs001141, the raw data are in sra format. We did sequence alignment to them. To keep in line with TCGA bam data, we used the same reference genome (downloaded from GDC portal) and operations (see GDC docs).

Of note, the preprocessing step was done on Linux server provided by High Performance Computing Service of ShanghaiTech University. So the work is presented in PBS scripts. If readers want to reproduce this work, please focus on the key code lines.

The details are described as below.

1. from `sra` to `fastq`

The raw WES data downloaded from dbGap database are in sra format, we need to convert them into fastq format firstly.

#PBS -N fastq_<sample>
#PBS -l nodes=1:ppn=1
#PBS -l walltime=20:00:00
#PBS -l mem=10gb
#PBS -S /bin/bash
#PBS -q normal_8
#PBS -j oe

source activate wes
workdir=/public/home/liuxs/ncbi/dbGaP-16533
cd $workdir
fastq-dump -outdir fastq --split-3 --skip-technical --clip --gzip $workdir/sra/<sample>.sra

2. removed the adapters

This step detected and removed the adapters with trimgalore (Krueger 2015).

#PBS -N <sample>_clean
#PBS -l nodes=1:ppn=1
#PBS -l walltime=12:00:00
#PBS -S /bin/bash
#PBS -q normal_8
#PBS -j oe
source activate wes
workdir=/public/home/liuxs/ncbi/dbGaP-16533

trim_galore -q 20 --phred33 --stringency 3 --length 20 -e 0.1 --paired --gzip $workdir/fastq/<sample>_1.fastq.gz \
$workdir/fastq/<sample>_2.fastq.gz -o $workdir/dnaseq/trimclean

3. bwa

This step aligned sequence reads to reference genome with recommended BWA MEM algorithm (Li 2013).

#PBS -N mem_<sample>
#PBS -l nodes=1:ppn=4
#PBS -l walltime=35:00:00
#PBS -l mem=20gb
#PBS -S /bin/bash
#PBS -q normal_3
#PBS -j oe

source activate wes
bwa mem -M -R "@RG\tID:<sample>\t\
LM:<sample>\t\
SM:<sample>\t\
PL:illumina\tPU:<sample>"\
 /public/home/liuxs/biodata/reference/genome/gdc_hg38/GRCh38.d1.vd1.fa\
 /public/home/liuxs/ncbi/dbGaP-16533/dnaseq/trimclean/<sample>_1_val_1.fq.gz\
 /public/home/liuxs/ncbi/dbGaP-16533/dnaseq/trimclean/<sample>_2_val_2.fq.gz\
><sample>.sam\
|mv <sample>.sam /public/home/liuxs/ncbi/dbGaP-16533/dnaseq/bwa

4. `sam` to `bam`

This step converted sam files to bam files with samtools (Li et al. 2009), which are smaller.

#PBS -N bam_<sample>
#PBS -l nodes=1:ppn=4
#PBS -l walltime=15:00:00
#PBS -l mem=20gb
#PBS -S /bin/bash
#PBS -q normal_8
#PBS -j oe

source activate wes
samtools view -bS /public/home/liuxs/ncbi/dbGaP-16533/dnaseq/bwa/<sample>.sam>\
/public/home/liuxs/ncbi/dbGaP-16533/dnaseq/bam/<sample>.bam

5. sort `bam`

Next we sorted the bam files with Picard (“Picard Toolkit” 2019).

#PBS -N sort_<sample>
#PBS -l nodes=1:ppn=4
#PBS -l walltime=15:00:00
#PBS -l mem=20gb
#PBS -S /bin/bash
#PBS -q normal_8
#PBS -j oe

source activate wes
java -jar -Xmx12G -Djava.io.tmpdir=/public/home/liuxs/lhm/tmp \
 /public/home/liuxs/anaconda3/envs/wes/share/picard-2.20.6-0/picard.jar  SortSam\
 I=/public/home/liuxs/ncbi/dbGaP-16533/dnaseq/bam/<sample>.bam\
 O=/public/home/liuxs/ncbi/dbGaP-16533/dnaseq/bamsort/<sample>.sort.bam\
 SORT_ORDER=coordinate

6. mark duplications

This step marked duplications with Picard.

#PBS -N rmdump_<sample> 
#PBS -l nodes=1:ppn=4
#PBS -l walltime=10:00:00
#PBS -l mem=10gb
#PBS -S /bin/bash
#PBS -q normal_8
#PBS -j oe

source activate wes
java -jar /public/home/liuxs/anaconda3/envs/wes/share/picard-2.20.6-0/picard.jar MarkDuplicates\
 I=/public/home/liuxs/ncbi/dbGaP-16533/dnaseq/bamsort/<sample>.sort.bam\
 O=/public/home/liuxs/ncbi/dbGaP-16533/dnaseq/markqump/<sample>.rdup.bam\
 VALIDATION_STRINGENCY=LENIENT \
 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 \
 M=/public/home/liuxs/ncbi/dbGaP-16533/dnaseq/markqump/matric/<sample>.sort.addhead.rmdup.metric

We then built index with samtools.

#PBS -N index_<sample>
#PBS -l nodes=1:ppn=4
#PBS -l walltime=10:00:00
#PBS -S /bin/bash
#PBS -q normal_8
#PBS -j oe

source activate wes

samtools index /public/home/liuxs/ncbi/dbGaP-16533/dnaseq/markqump/<sample>.rdup.bam\
 /public/home/liuxs/ncbi/dbGaP-16533/dnaseq/markqump/<sample>.rdup.bam.bai

7. BQSR

A base quality score recalibration (BQSR) step was then performed.

#PBS -N bqsr_<sample1>
#PBS -l nodes=1:ppn=4
#PBS -l walltime=30:00:00
#PBS -S /bin/bash
#PBS -q normal_8
#PBS -j oe

source activate wes
cd /public/home/liuxs/ncbi/dbGaP-16533/dnaseq/markqump

sample=<sample1>
dbsnp=/public/home/liuxs/biodata/reference/genome/hg38/ftp.broadinstitute.org/bundle/hg38/dbsnp_146.hg38.vcf.gz
dbsnp1000G=/public/home/liuxs/biodata/reference/genome/hg38/ftp.broadinstitute.org/bundle/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz 
dbindel100G=/public/home/liuxs/biodata/reference/genome/hg38/ftp.broadinstitute.org/bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz

if [ -e $sample.rdup.bam ]; then
   java -jar -Xmx12G -Djava.io.tmpdir=/public/home/liuxs/lhm/tmp  /public/home/liuxs/anaconda3/envs/wes/share/gatk4-4.1.3.0-0/gatk-package-4.1.3.0-local.jar BaseRecalibrator \
    -R /public/home/liuxs/biodata/reference/genome/gdc_hg38/GRCh38.d1.vd1.fa \
    -I ${sample}.rdup.bam \
    -O /public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/table/${sample}.recal_data.table \
    --known-sites $dbsnp --known-sites $dbsnp1000G --known-sites $dbindel100G && \
   java -jar -Xmx12G -Djava.io.tmpdir=/public/home/liuxs/lhm/tmp   /public/home/liuxs/anaconda3/envs/wes/share/gatk4-4.1.3.0-0/gatk-package-4.1.3.0-local.jar  ApplyBQSR  \
    -R /public/home/liuxs/biodata/reference/genome/gdc_hg38/GRCh38.d1.vd1.fa \
    -I ${sample}.rdup.bam \
    --bqsr-recal-file /public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/table/${sample}.recal_data.table \
    -O /public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/bqsrbam/${sample}.sorted.marked.BQSR.bam 
    echo gatk-BQSR `date`
fi

The result bam files were then used for copy number calling and mutation calling.

Copy number calling

The absolute copy number profile for each sample was detected by two well known softwares FACETS (Shen and Seshan 2016) and Sequenza (Favero et al. 2014).

The code to generate tumor-normal pair for FACETS is here, the code to generate tumor-normal pair for Sequenza is here.

FACETS

We followed the standard pipeline of FACETS described in vignette to call absolute copy number.

1. SNP pileup

This step followed the https://github.com/mskcc/facets/blob/master/inst/extcode/README.txt.

For dbGap studies:

#PBS -N <sample>_snp
#PBS -l nodes=1:ppn=1
#PBS -l walltime=10:00:00
#PBS -S /bin/bash
#PBS -q normal_8
#PBS -j oe

cd /public/home/liuxs/anaconda3/envs/R/lib/R/library/facets/extcode
source activate R

# <sample1>=tumorbam
# <sample2>=normalbam

path1="/public/home/liuxs/ncbi/dbGaP-16533/copy/out"
path2="/public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/bqsrbam"
path3="/public/home/liuxs/ncbi/dbGaP-21926/dnaseq/BQSR/bqsrbam"

name=<sample1>

if [ -e /public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/bqsrbam/$name.sorted.marked.BQSR.bam ]; 
 then 
  ./snp-pileup -g -q15 -Q20 -P100 -r25,0 /public/home/liuxs/biodata/reference/genome/hg38/snpvcf/common_all_20180418.vcf.gz\
  ${path1}/<sample>.out.gz \
  ${path2}/<sample2>.sorted.marked.BQSR.bam \
  ${path2}/<sample1>.sorted.marked.BQSR.bam
 else
  ./snp-pileup -g -q15 -Q20 -P100 -r25,0 /public/home/liuxs/biodata/reference/genome/hg38/snpvcf/common_all_20180418.vcf.gz\
 ${path1}/<sample>.out.gz \
 ${path3}/<sample2>.sorted.marked.BQSR.bam  \
 ${path3}/<sample1>.sorted.marked.BQSR.bam 
fi

For TCGA:

#PBS -N <sample>_snp
#PBS -l nodes=1:ppn=1
#PBS -l walltime=10:00:00
#PBS -S /bin/bash
#PBS -q normal_8
#PBS -j oe

cd /public/home/liuxs/anaconda3/envs/R/lib/R/library/facets/extcode
source activate R

# <sample1>=tumorbam
# <sample2>=normalbam

path1="/public/home/liuxs/ncbi/dbGaP-16533/copy/csv"
path2="/public/home/liuxs/biodata/gdc/links/TCGA_PRAD"


./snp-pileup -g -q15 -Q20 -P100 -r25,0 /public/home/liuxs/biodata/reference/genome/hg38/snpvcf/common_all_20180418.vcf.gz ${path1}/<sample>.out.gz ${path2}/<sample2>.bam ${path2}/<sample1>.bam

2. run `facets` package

This is a template R script used for copy number calling.

#! /usr/bin/env Rscript

library("pctGCdata")
library("facets")
set.seed(1234)
rcmat = readSnpMatrix("/public/home/liuxs/ncbi/dbGaP-16533/copy/out/<sample>.out.gz")
xx = preProcSample(rcmat,gbuild = "hg38")
oo=procSample(xx,cval=150)
fit=emcncf(oo)

#plot 
pdf("/public/home/liuxs/ncbi/dbGaP-16533/copy/facetdata_150/<sample>.pdf")
plotSample(x=oo,emfit=fit)
logRlogORspider(oo$out, oo$dipLogR) 
while (!is.null(dev.list()))  dev.off()
# 
save(fit,file = "/public/home/liuxs/ncbi/dbGaP-16533/copy/facetdata_150/<sample>.Rdata")
# output purity and ploidy -----
purity=fit$purity
purity=round(purity,2)
ploidy=fit$ploidy
ploidy=round(ploidy,1)
output <- paste("<sample>", purity, ploidy, sep = "\t")
write(output, "/public/home/liuxs/ncbi/dbGaP-16533/copy/facetdata_150/<sample>.txt", append = TRUE)

This is the PBS script used for calling the R script above.

##### submit R script
#PBS -N R
#PBS -l nodes=1:ppn=1
#PBS -l walltime=50:00:00
#PBS -S /bin/bash
#PBS -q normal_8
#PBS -j oe

source activate R
for i in /public/home/liuxs/ncbi/dbGaP-16533/copy/Rdata/re150/pbs_150/*pbs;do Rscript $i;done

Sequenza

We followed the standard pipeline of Sequenza described in vignette to call absolute copy number. A modified copynumber package was used to work with hg38 genome build.

1. process a `fasta` file to produce a GC `wiggle` track file

source activate python3

ref_file=/public/home/liuxs/biodata/reference/genome/gdc_hg38/GRCh38.d1.vd1.fa
gc_file=/public/home/wangshx/wangshx/PRAD_Sig/hg38.gc50Base.wig.gz

mkdir -p $(dirname $gc_file)
# sequenza-utils version: 3.0.0
sequenza-utils gc_wiggle -w 50 --fasta $ref_file -o $gc_file

2. process `bam` and `wiggle` files to produce a small `seqz` file:

#PBS -N PBS_<sample>_seqz
#PBS -l nodes=1:ppn=1
#PBS -l walltime=70:00:00
#PBS -S /bin/bash
#PBS -j oe
#PBS -q normal_8

source activate python3

ref_file=/public/home/liuxs/biodata/reference/genome/gdc_hg38/GRCh38.d1.vd1.fa
gc_file=/public/home/wangshx/wangshx/PRAD_Sig/hg38.gc50Base.wig.gz

out_dir=/public/home/wangshx/wangshx/PRAD_Sig
seqz_dir=$out_dir"/seqz"
sseqz_dir=$out_dir"/small-seqz"

mkdir -p $seqz_dir
mkdir -p $sseqz_dir

#tumor=<tumor>
#normal=<normal>

sequenza-utils bam2seqz -n <normal> -t <tumor> \
    --fasta $ref_file -gc $gc_file \
    -o $seqz_dir/"<sample>.seqz.gz"

sequenza-utils seqz_binning --seqz $seqz_dir/"<sample>.seqz.gz" \
    -w 50 -o $sseqz_dir/"<sample>.small.seqz.gz"

# Only keep chr1-22,X,Y
zcat $sseqz_dir/"<sample>.small.seqz.gz" | \
    awk '/^chromosome|chr[12]?[0-9XY]\t/{print}' | \
    gzip > $sseqz_dir/"<sample>.small_filter.seqz.gz"

3. call copy number in R

This is a template R script used for copy number calling.

#! /usr/bin/env Rscript

library("sequenza")

args <- commandArgs(TRUE)
sample_id <- args[1]
input_file <- args[2]
out_dir <- args[3]

# https://github.com/ShixiangWang/copynumber
test <- sequenza.extract(input_file, assembly = "hg38")

CP <- sequenza.fit(test, female = FALSE)

sequenza.results(
  sequenza.extract = test,
  cp.table = CP,
  sample.id = sample_id,
  out.dir = out_dir,
  female = FALSE
)

This is a template PBS script used for calling the R script above.

#PBS -N PBS_<sample>_seqz
#PBS -l nodes=1:ppn=1
#PBS -l walltime=70:00:00
#PBS -S /bin/bash
#PBS -j oe
#PBS -q normal_8

source activate R_36

out_dir=/public/home/wangshx/wangshx/PRAD_Sig
sseqz_dir=$out_dir"/small-seqz"
res_dir=$out_dir"/seqz_wes_result"

Rscript $out_dir"/3-sequenza.R" <sample> $sseqz_dir/"<sample>.small_filter.seqz.gz" $res_dir

Mutation calling

Somatic mutations were detected by following GATK best practice with Mutect2 (Van der Auwera et al. 2013).

1. call somatic SNVs and short INDELs

#PBS -N call-<sample>
#PBS -l nodes=1:ppn=7
#PBS -l walltime=30:00:00
#PBS -S /bin/bash
#PBS -q normal_8
#PBS -j oe

source activate wes
gene=/public/home/liuxs/biodata/reference/genome/gdc_hg38
normal=/public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/bqsrbam
tumor=/public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/bqsrbam
path=/public/home/liuxs/ncbi/dbGaP-21926/dnaseq/BQSR/bqsrbam
ref=/public/home/liuxs/biodata/reference/genome/hg38/test/ftp.broadinstitute.org/bundle/Mutect2
out=/public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/vcf
inter=/public/home/liuxs/ncbi/dbGaP-16533/run/gatk/mutect
out2=/public/home/liuxs/ncbi/dbGaP-21926/dnaseq/BQSR/vcf
name=<sample1>
if [ -e /public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/bqsrbam/$name.sorted.marked.BQSR.bam ]; 
then 
gatk --java-options "-Xmx20G -Djava.io.tmpdir=/public/home/liuxs/lhm/tmp"  Mutect2 \
 -R $gene/GRCh38.d1.vd1.fa \
 -L $inter/Homo_sapiens_assembly38.targets.interval_list \
 -I $tumor/<sample1>.sorted.marked.BQSR.bam \
 -tumor <sample1> \
 -I $normal/<sample2>.sorted.marked.BQSR.bam  \
 -normal <sample2>\
 --germline-resource $ref/af-only-gnomad.hg38.vcf.gz \
 --panel-of-normals /public/home/liuxs/biodata/reference/genome/hg38/test/mu_call_ref/ref/ref/test/1000g_pon.hg38.vcf.gz \
 --af-of-alleles-not-in-resource 0.0000025 \
 --disable-read-filter  MateOnSameContigOrNoMappedMateReadFilter \
 --bam-output $out/bam/<sample>.bam \
 --f1r2-tar-gz $out/fir2/<sample>.tar.gz \
 -O $out/<sample>.vcf
else
gatk --java-options "-Xmx20G -Djava.io.tmpdir=/public/home/liuxs/lhm/tmp"  Mutect2 \
   -R $gene/GRCh38.d1.vd1.fa \
   -L $inter/Homo_sapiens_assembly38.targets.interval_list \
   -I $path/<sample1>.sorted.marked.BQSR.bam \
   -tumor <sample1> \
   -I $path/<sample2>.sorted.marked.BQSR.bam  \
   -normal <sample2>\
   --germline-resource $ref/af-only-gnomad.hg38.vcf.gz \
   --panel-of-normals /public/home/liuxs/biodata/reference/genome/hg38/test/mu_call_ref/ref/ref/test/1000g_pon.hg38.vcf.gz \
   --af-of-alleles-not-in-resource 0.0000025 \
   --disable-read-filter  MateOnSameContigOrNoMappedMateReadFilter \
   --bam-output $out2/bam/<sample>.bam \
   --f1r2-tar-gz $out2/fir2/<sample>.tar.gz \
   -O $out2/<sample>.vcf
fi

2. estimate cross-sample contamination using GetPileupSummaries and CalculateContamination.

#PBS -N getpile_normal_<sample2>
#PBS -l nodes=1:ppn=7
#PBS -l walltime=30:00:00
#PBS -S /bin/bash
#PBS -q normal_8
#PBS -j oe

source activate wes
bam=/public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/bqsrbam
SNP=/public/home/liuxs/biodata/reference/genome/hg38/test/ftp.broadinstitute.org/bundle/Mutect2
out=/public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/vcf/piptable
path=/public/home/liuxs/ncbi/dbGaP-21926/dnaseq/BQSR/bqsrbam
out2=/public/home/liuxs/ncbi/dbGaP-21926/dnaseq/BQSR/vcf/piptable

gatk --java-options "-Xmx20G -Djava.io.tmpdir=/public/home/wangshx/wx/tmp"  GetPileupSummaries \
-I $bam/<sample2>.sorted.marked.BQSR.bam \
-L /public/home/liuxs/ncbi/dbGaP-16533/run/gatk/mutect/Homo_sapiens_assembly38.targets.interval_list \
-V $SNP/af-only-gnomad.hg38.vcf.gz \
-O $out/<sample2>-normal.pileups.table

#PBS -N seg_and_cont
#PBS -l nodes=1:ppn=7
#PBS -l walltime=30:00:00
#PBS -S /bin/bash
#PBS -q normal_8
#PBS -j oe

source activate wes

input=/public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/vcf/piptable
input2=/public/home/liuxs/ncbi/dbGaP-21926/dnaseq/BQSR/vcf/piptable

name=<sample1>
if [ -e /public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/vcf/piptable/$name-tumor.pileups.table ]; 
then
gatk --java-options "-Xmx20G -Djava.io.tmpdir=/public/home/wangshx/wx/tmp"  CalculateContamination \
-I $input/<sample1>-tumor.pileups.table \
-matched $input/<sample2>-normal.pileups.table \
-O $input/<sample>-contamination.table \
-tumor-segmentation $input/<sample>-segments.table
else
gatk --java-options "-Xmx20G -Djava.io.tmpdir=/public/home/wangshx/wx/tmp"  CalculateContamination \
 -I $input2/<sample1>-tumor.pileups.table \
 -matched $input2/<sample2>-normal.pileups.table \
 -O $input2/<sample>-contamination.table \
 -tumor-segmentation $input2/<sample>-segments.table
fi

3. get tumor artifacts using LearnReadOrientationModel.

#PBS -N flr_<sample>
#PBS -l nodes=1:ppn=7
#PBS -l walltime=20:00:00
#PBS -S /bin/bash
#PBS -q normal_8
#PBS -j oe

source activate wes
out=/public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/vcf/fir2

java -jar /public/home/liuxs/anaconda3/envs/wes/share/gatk4-4.1.3.0-0/gatk-package-4.1.3.0-local.jar  LearnReadOrientationModel \
-I /public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/vcf/fir2/<sample>.tar.gz  \
-O $out/<sample>-tumor-artifact-prior.tar.gz

4. keep confident somatic calls using FilterMutectCalls.

#PBS -N filter-<sample>
#PBS -l nodes=1:ppn=4
#PBS -l walltime=20:00:00
#PBS -S /bin/bash
#PBS -q normal_8
#PBS -j oe

source activate wes
gene=/public/home/liuxs/biodata/reference/genome/gdc_hg38
out=/public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/vcf
input=/public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/vcf/piptable
fir2=/public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/vcf/fir2


gatk --java-options "-Xmx20G -Djava.io.tmpdir=/public/home/liuxs/lhm/tmp" FilterMutectCalls \
-R $gene/GRCh38.d1.vd1.fa \
-V $out/<sample>.vcf \
-O /public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/filter/<sample>.filter.vcf \
--tumor-segmentation $input/<sample>-segments.table \
--contamination-table $input/<sample>-contamination.table \
--ob-priors $fir2/<sample>-tumor-artifact-prior.tar.gz \
--min-allele-fraction 0.05

5. get PASS labeled mutation

#PBS -N call-<sample>
#PBS -l nodes=1:ppn=7
#PBS -l walltime=30:00:00
#PBS -S /bin/bash
#PBS -q normal_8
#PBS -j oe

source activate wes
vcftools --vcf /public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/filter/<sample>.filter.vcf\
  --remove-filtered-all --recode --recode-INFO-all --out /public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/PASS/<sample>-SNvs_only

6. annotate mutations

This step annotated somatic mutations with VEP (McLaren et al. 2016).

#PBS -N annotate-<sample>
#PBS -l nodes=1:ppn=4
#PBS -l walltime=20:00:00
#PBS -S /bin/bash
#PBS -q normal_8
#PBS -j oe

source activate neo

vep  --input_file /public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/PASS/<sample>-SNvs_only.recode.vcf \
--output_file /public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/annonate/<sample>.anno.vcf \
--symbol --term SO  --format vcf --vcf --cache --assembly GRCh38 --tsl \
--hgvs --fasta ~/.vep/homo_sapiens/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz \
--offline --cache --dir_cache ~/.vep/ \
--transcript_version --cache_version 98 \
--plugin Downstream  --plugin Wildtype \
--dir_plugins ~/.vep/Plugins

7. convert VCF to MAF

#PBS -N call-<sample>
#PBS -l nodes=1:ppn=7
#PBS -l walltime=30:00:00
#PBS -S /bin/bash
#PBS -q normal_8
#PBS -j oe

source activate neo
 vcf2maf.pl \
 --input-vcf /public/home/liuxs/ncbi/dbGaP-16533/mutation/annonatevcf/<sample>.anno.vcf \
 --output-maf /public/home/liuxs/ncbi/dbGaP-16533/mutation/vcf2maf/<sample>.maf \
 --ref-fasta /public/home/liuxs/biodata/reference/genome/gdc_hg38/GRCh38.d1.vd1.fa \
 --tumor-id <sample1> --normal-id <sample2>  --vep-path /public/home/liuxs/anaconda3/envs/neo/bin \
 --ncbi-build GRCh38

Data cleaning

This section describes how we cleaned the data for downstream analysis.

Phenotype data

2 info tables of SRA runs downloaded from dbGap are stored in here. Phenotype data of dbGap studies are stored in here. Phenotype data of some cohorts were also downloaded from original articles and cleaned by hand. We then carefully checked and compared all available data, extracted key info and generated tidy datasets including sample-pair data used for variation calling and phenotype data used for downstream analysis.

The whole procedure is recorded in 01-tidy_mapping.R, 02-tidy_clinical_data.R and 03-combine_and_clean_IDs.R.

Survival data

We merged two available clinical datasets from TCGA and phs000554 with R script.

Variation profile

Copy number calling and mutation calling all generated per-sample results. We go further merged them into one file, respectively.

For results from Sequenza, we merged them with R script 00-generate-samplefile-for-seqz-cnv-calling.R.

For results from FACETS, we merged them with R script 00-generate-samplefile-for-facets-cnv-calling.R.

For results from Mutect2, we merged VCFs into MAF with tool vcf2maf. TODO: more detail code.

Signature Identification

This part has been divided into 4 sections as the following, it describes how to get the genomic variation (or instability) signatures from mutation profile (including copy number profile and single base substitution (SBS) profile).

Tally variation components
Estimate signature number
Extract signatures
Show signature profile and aetiology

All the processes have been implemented in R package sigminer.

Tally variation components

Methods

For SBS profile, same as previously reported (Alexandrov et al. 2013), for each sample, we firstly classified mutation records into six substitution subtypes: C>A, C>G, C>T, T>A, T>C, and T>G (all substitutions are referred to by the pyrimidine of the mutated Watson–Crick base pair). Further, each of the substitutions was examined by incorporating information on the bases immediately 5’ and 3’ to each mutated base generating 96 possible mutation types (6 types of substitution ∗ 4 types of 5’ base ∗ 4 types of 3’ base). Each of 96 mutation types is called component here.

For copy number profile, we firstly computed the genome-wide distributions of 8 fundamental copy number features for each sample:

the breakpoint count per 10 Mb (named BP10MB)
the breakpoint count per chromosome arm (named BPArm)
the copy number of the segments (named CN)
the difference in copy number between adjacent segments (named CNCP)
the lengths of oscillating copy number segment chains (named OsCN)
the log10 based size of segments (named SS)
the minimal number of chromosome with 50% copy number variation (named NC50)
the burden of chromosome (named BoChr)

These features were selected as hallmarks of previously reported genomic aberations like chromothripsis or to denote the distribution pattern of copy number events. The former 6 features have been used in Macintyre et al. (2018) to uncover the mutational processes in ovarian carcinoma.

Next, unlike Macintyre et al. (2018) applied mixture modeling to separate the first 6 copy number features distributions into mixtures of Poisson or Gaussian distributions, we directly classified 8 copy number features distributions into different components according to the comprehensive consideration of value range, abundance and biological significance. Most of the result are discrete values, and the others are range values.

The setting of 8 features with 80 components are shown as below.

DT::datatable(sigminer::CN.features)

When component is a discrete value (label is ‘point’), min=max.

When component is a range value (label is ‘range’), the range is left open and right closed.

Of note, the blank in min column is -Inf and the blank in max column is Inf. [23] in BoChr represents chromosome X (chromosome Y is excluded).

Compare to the method from Macintyre et al. (2018), our method has several advantages:

the computation is more efficient.
the meaning of component is easier to read and understand.
the new features NC50 and BoChr can be used to determine the distribution pattern of copy number events (global or local). and the contribution of each chromosome.
most importantly, the classification is fixed, so it is much easier to compare the signatures within/across tumor types, the results across different studies and the result signatures to reference signatures. In a word, our method constructs the standard to extract, study and compare copy number signatures.
our method is more extensible.

I talk more about No.1-4 in an individual section (“Method comparison between Macintyre et al and our study”), the No.5 will be described in sigminer vignette(TBD).

We will generate sample-by-component matrix using the methods above and treate the result as the input of nonnegative matrix decomposition (NMF) algorithm for extracting signatures.

Tally components

Copy number

Read absolute copy number profile as a CopyNumber object.

# Load packages -----------------------------------------------------------

suppressPackageStartupMessages(library(sigminer))
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(NMF))

# Set this per R session
options(sigminer.sex = "male", sigminer.copynumber.max = 20L)

# Generate CopyNumber object ----------------------------------------------

CNV.seqz <- read_copynumber("../data/CNV_from_sequenza.tsv",
  genome_build = "hg38",
  complement = FALSE, verbose = TRUE
)

# remove WCMC160-SRR3146971 with only one CNV
CNV.seqz <- subset(CNV.seqz, subset = !sample %in% "WCMC160-SRR3146971")
save(CNV.seqz, file = "../output/CNV.seqz.RData")

Tally the alteration components.

ncores <- 20 # Set a proper number of thread

# Use classfication method devised by me ("W")
CNV.seqz.tally.W <- sig_tally(CNV.seqz, method = "W", cores = ncores, feature_setting = CN.features)
save(CNV.seqz.tally.W, file = "../output/CNV.seqz.tally.W.RData")

SBS

Read SBS profile as a MAF object.

# Reading data ------------------------------------------------------------
Maf <- data.table::fread("/public/data/maf/all.maf")
# Remove all NA columns
Maf <- Maf[, which(unlist(lapply(Maf, function(x) !all(is.na(x))))), with = F]
Maf <- read_maf(Maf)
save(Maf, file = "../output/PRAD_TCGA_plus_dbGap_Maf.RData")

The result MAF object can be used in any analysis provided by R package maftools (Mayakonda and Koeffler 2016).

Tally the SBS components.

Maf.tally <- sig_tally(Maf,
                       cores = ncores, ref_genome = "BSgenome.Hsapiens.UCSC.hg38",
                       useSyn = TRUE)
save(Maf.tally, file = "../output/PRAD_TCGA_plus_dbGap_Maf.tally.RData")

Estimate signature number

Here, r package NMF (Gaujoux and Seoighe 2010) is used for running NMF algorithm Factorization rank r in NMF defines the number of signatures used to approximate the target sample-by-component matrix. A common way for deciding on r is to try different values, compute some quality measure of the results, and choose the best value according to this quality criteria.

As suggested, performing 30-50 runs is considered sufficient to get a robust estimation of the r value. Here we perform 50 runs.

Copy number

Estimate the number of copy number signature from 2 to 12.

EST.seqz.W.all <- sig_estimate(CNV.seqz.tally.W$nmf_matrix,
  range = 2:12, nrun = 50, cores = ncores, use_random = TRUE,
  save_plots = FALSE,
  verbose = TRUE
)
save(EST.seqz.W.all, file = "../output/EST.seqz.W.all.RData")

SBS

Estimate the number of SBS signature from 2 to 10.

# Remove the effect of hyper mutated samples (not removing hyper-mutated samples)
# The idead is adopted from SignatureAnalyzer package
nmf_matrix <- handle_hyper_mutation(Maf.tally$nmf_matrix)

EST.Maf <- sig_estimate(nmf_matrix,
  range = 2:10, nrun = 50, cores = ncores, use_random = TRUE,
  save_plots = FALSE,
  verbose = TRUE
)

save(EST.Maf, file = "../output/EST.PRAD_TCGA_plus_dbGap_Maf.RData")
save(nmf_matrix, file = "../output/Maf_matrix.RData")

Result

Let’s load the estimated results.

load(file = "../output/EST.seqz.W.all.RData")
load(file = "../output/EST.PRAD_TCGA_plus_dbGap_Maf.RData")

The most common approach is to use the cophenetic correlation coefficient. I show the measure vs. signature number as below.

show_sig_number_survey(EST.seqz.W.all, right_y = NULL)

The best possible signature number value is the one at which the cophenetic correlation value on the y-axis drops significantly (Gaujoux and Seoighe 2010). Here we can take that the value 5 is the optimal signature number. For 3, it is small to produce the result with biological significance due to the hugo contribution of CNV in prostate cancer; for 8, it is a little bigger, so it may be not easy to understand the result considering currently we know few about the mechanism and consequence of copy number events.

p = show_sig_number_survey(EST.seqz.W.all, right_y = NULL)
add_h_arrow(p, x = 5.2, y = 0.99, seg_len = 0.5, space = 0.2)

Next we take a look at SBS.

p = show_sig_number_survey(EST.Maf, right_y = NULL)
add_h_arrow(p, x = 3.2, y = 0.982, seg_len = 0.5, space = 0.2)

For SBS, we select 3. We have the knowledge that 3 mutational signatures from TCGA prostate cancers are recorded in COSMIC database (COSMIC 1, 5, 6), and the cophenetic has a sharp decrease after 3 indicating that substantially less stability is achieved using more than 3 clusters.

Extract signatures

Now that we have determined the signature number, we then extract the signatures with 50 runs.

Copy number:

# Extract signatures ------------------------------------------------------
Sig.CNV.seqz.W <- sig_extract(CNV.seqz.tally.W$nmf_matrix, n_sig = 5, nrun = 50, cores = ncores)
save(Sig.CNV.seqz.W, file = "../output/Sig.CNV.seqz.W.RData")

SBS:

# Extract signatures ------------------------------------------------------
Sig.SNV <- sig_extract(nmf_matrix, n_sig = 3, nrun = 50, cores = ncores)
save(Sig.SNV, file = "../output/Sig.PRAD_TCGA_plus_dbGap_Maf.RData")

Show signature profile and aetiology

Now that we have the extracted signatures, we can show the signature profile.

load("../output/Sig.PRAD_TCGA_plus_dbGap_Maf.RData")
load("../output/Sig.CNV.seqz.W.RData")

SBS signature profile

# I provide two style 'default' and 'cosmic' in sigminer
show_sig_profile(Sig.SNV, mode = "SBS", style = "cosmic", x_label_angle = 90, x_label_vjust = 0.5)

SBS signatures are displayed based on the observed component frequency of the human genome, i.e., representing the relative proportions of mutations generated by each signature based on the actual trinucleotide frequencies of the reference human genome.

The aetiology of SBS signatures is more clear than the aetiology of copy number signatures. We can find aetiologies of the 3 signatures above by computing their cosine similarity to COSMIC reference signatures.

We compare the 3 signature to COSMIC signature database v2 and v3.

sim_v2 = get_sig_similarity(Sig.SNV, sig_db = "legacy")
#> -Comparing against COSMIC signatures
#> ------------------------------------
#> --Found Sig1 most similar to COSMIC_3
#>    Aetiology: defects in DNA-DSB repair by HR [similarity: 0.844]
#> --Found Sig2 most similar to COSMIC_15
#>    Aetiology: defective DNA mismatch repair [similarity: 0.943]
#> --Found Sig3 most similar to COSMIC_1
#>    Aetiology: spontaneous deamination of 5-methylcytosine [similarity: 0.935]
#> ------------------------------------
#> Return result invisiblely.

sim_v3 = get_sig_similarity(Sig.SNV, sig_db = "SBS")
#> -Comparing against COSMIC signatures
#> ------------------------------------
#> --Found Sig1 most similar to SBS40
#>    Aetiology: Unknown [similarity: 0.883]
#> --Found Sig2 most similar to SBS15
#>    Aetiology: Defective DNA mismatch repair [similarity: 0.967]
#> --Found Sig3 most similar to SBS1
#>    Aetiology: spontaneous or enzymatic deamination of 5-methylcytosine [similarity: 0.966]
#> ------------------------------------
#> Return result invisiblely.

From the results above, basically we can confirm that the aetiology of Sig2 is dMMR (defects in mismatch repair) and the aetiology of Sig3 is aging. The aetiology of Sig1 may be HRD (Defective homologous recombination-based DNA damage repair) or an unknown reason.

p = show_sig_profile(Sig.SNV, mode = "SBS", style = "cosmic", x_label_angle = 90, x_label_vjust = 0.5)
add_labels(p, x = 0.92, y = 0.3, y_end = 0.85, n_label = 3, 
           labels = rev(c("HRD or unknown", "dMMR", "Aging")), hjust = 1)

The exposure of signatures in each sample is plotted as the below.

show_sig_exposure(Sig.SNV, rm_space = TRUE, style = "cosmic")

Set a cutoff 2000 to remove some outliers.

show_sig_exposure(Sig.SNV, 
                  cutoff = 2000,
                  rm_space = TRUE, style = "cosmic")

The Est_Counts panel shows the estimated SBS count.

We can clearly see that

most of samples have very few SBS signature exposures.
Sig1 and Sig3 are the major resources.

Copy number signature profile

show_sig_profile(Sig.CNV.seqz.W,
                 mode = "copynumber",
                 style = "cosmic", method = "W", normalize = "feature")

Similar to SBS signatures, copy number signatures are displayed based on the observed component frequency of the human genome. Of note, considering the count process of each feature is relatively independent, the profile is row normalized by each feature, unlike Macintyre et al. (2018) did column normalization (this method is easy to mislead readers), so the bar height can be compared within/between features.

There is no reference database for copy number signatures, we cannot get aetiologies of the 5 signatures by similarity computation. By reading the description of previous reports (Macintyre et al. 2018; Yi and Ju 2018) and studying the signature profile carefully, we determine or limite aetiologies of the 5 signatures with also the downstream analyses supported. Please see our manuscript (TODO) for details.

The exposure of signatures in each sample is plotted as the below.

show_sig_exposure(Sig.CNV.seqz.W, rm_space = TRUE, style = "cosmic")

The Est_Counts panel shows the estimated copy number segment count.

Show similarity matrix

pheatmap::pheatmap(sim_v2$similarity,
  cluster_rows = F
)

pheatmap::pheatmap(sim_v3$similarity,
  cluster_rows = F
)

Check the signature exposure

Signature exposure is the estimation of mutation calling records. Here we check this by correlation analysis.

Mutation

Observe total SBS mutations from data.

load(file = "../output/PRAD_TCGA_plus_dbGap_Maf.RData")
Maf_dt = rbind(Maf@data, Maf@maf.silent)
Mut_dt = Maf_dt[Variant_Type == "SNP", .(Total = .N), by = Tumor_Sample_Barcode]

Estimate SBS mutations from signature exposure.

Mut_expo = get_sig_exposure(Sig.SNV)[, .(sample, Est=Sig1+Sig2+Sig3)]

Merge the two data.table above.

Mut_merged_dt = merge(Mut_dt, Mut_expo, by.x = "Tumor_Sample_Barcode", by.y = "sample")
# log10 the data
Mut_merged_dt[, `:=`(Total = log10(Total), Est = log10(Est))]

head(Mut_merged_dt)
#>    Tumor_Sample_Barcode     Total       Est
#> 1:           SRR2404024 1.6989700 1.6989700
#> 2:           SRR2404067 0.9542425 0.9542425
#> 3:           SRR2404095 1.9344985 1.9344984
#> 4:           SRR2404185 1.7708520 1.7708520
#> 5:           SRR2404230 1.8325089 1.8325089
#> 6:           SRR2404278 1.9777236 1.9777236

Now we can see their correlation.

ggpubr::ggscatter(Mut_merged_dt, x = "Total", y = "Est", add = "reg.line",
                  xlab = "Log10 based total SBS counts (observed)",
                  ylab = "Log10 based total SBS counts (estimated)") +
  ggpubr::stat_cor(label.x.npc = 0.5, label.y.npc = 0.8)

We can know that we did a perfect NMF for SBS data.

Copy number

Observe total segment counts from copy number data.

load(file = "../output/CNV.seqz.RData")
CN_dt = CNV.seqz@data[, .(Total = .N), by = sample]

Estimate segment counts from signature exposure.

CN_expo = get_sig_exposure(Sig.CNV.seqz.W)[, .(sample, Est=Sig1+Sig2+Sig3+Sig4+Sig5)]

Merge the two data.table above.

CN_merged_dt = merge(CN_dt, CN_expo, by = "sample")

head(CN_merged_dt)
#>                 sample Total      Est
#> 1: 00-000450-SRR474658    58 61.71629
#> 2:   00-1165-SRR461844    51 51.71139
#> 3:    00-160-SRR471373    65 68.69707
#> 4:   00-1823-SRR471613    75 79.23348
#> 5:   01-1934-SRR471793    80 84.30801
#> 6:   01-2382-SRR471433    51 54.44634

Now we can see their correlation.

ggpubr::ggscatter(CN_merged_dt, x = "Total", y = "Est", add = "reg.line",
                  xlab = "Log10 based total segment counts (observed)",
                  ylab = "Log10 based total segment counts (estimated)") +
  ggpubr::stat_cor(label.x.npc = 0.5, label.y.npc = 0.8, hjust = 1)

We can see that the estimation of copy number segment count fits observed data well. The residue can be further visualized by histgram.

CN_merged_dt[, Residue:=Est-Total]

library(ggplot2)
ggplot(CN_merged_dt) +
  geom_histogram(aes(x = Residue))

Association Analysis

This part has been divided into 2 sections: data integration and association analysis.

Data integration

In this section, I integrate all genotype/phenotype data as a tidy data table used for downstream analyses.

Load packages and prepared data.

# Integrate all informaiton to sample level
library(tidyverse)
library(sigminer)
library(maftools)

# Loading clinical related data -------------------------------------------

Info <- readRDS("../data/PRAD_CLINICAL.rds")
# Purity and ploidy info from sequenza
PurityInfo <- read_tsv("../data/PRAD_Purity_and_Ploidy_Sequenza.tsv")

# Processing CNV data -----------------------------------------------------

load("../output/CNV.seqz.RData")
load("../output/Sig.CNV.seqz.W.RData")
CNV <- CNV.seqz
Sig.CNV <- Sig.CNV.seqz.W
rm(Sig.CNV.seqz.W, CNV.seqz)

After running NMF, we can get robust clusters from consensus matrix, more see ?NMF::predict.

I also create functions get_sig_exposure() to get the absolute/relative exposure of signatures and scoring() to get scores of typical copy number features. More run ?sigminer::get_sig_exposure and ?sigminer::scoring in your R console or see the package manual.

Relative exposure is more useful for clustering and absolute exposure is more useful for association analysis.

# CNV
CNVGroupInfo <- get_groups(Sig.CNV, method = "consensus", match_consensus = TRUE)
#> ℹ [2020-06-04 14:43:34]: Started.
#> ✓ [2020-06-04 14:43:34]: 'Signature' object detected.
#> ℹ [2020-06-04 14:43:34]: Obtaining clusters from the hierarchical clustering of the consensus matrix...
#> ℹ [2020-06-04 14:43:35]: Finding the dominant signature of each group...
#> => Generating a table of group and dominant signature:
#>    
#>     Sig1 Sig2 Sig3 Sig4 Sig5
#>   1    9    2   14   10    0
#>   2    0   60    0    3    0
#>   3    0    3  219    2    0
#>   4    0    0    0  170    0
#>   5    0    3   35  200  207
#> => Assigning a group to a signature with the maxium fraction (stored in 'map_table' attr)...
#> ℹ [2020-06-04 14:43:36]: Summarizing...
#>  group #1: 35 samples with Sig3 enriched.
#>  group #2: 63 samples with Sig2 enriched.
#>  group #3: 224 samples with Sig3 enriched.
#>  group #4: 170 samples with Sig4 enriched.
#>  group #5: 445 samples with Sig5 enriched.
#> ! [2020-06-04 14:43:36]: The 'enrich_sig' column is set to dominant signature in one group, please check and make it consistent with biological meaning (correct it by hand if necessary).
#> ℹ [2020-06-04 14:43:36]: 1.517 secs elapsed.
CNVInfo <- CNV@summary.per.sample
CNVExposureInfo <- get_sig_exposure(Sig.CNV)
CNVscores <- scoring(CNV)

We find that the group 1 is assigned to Sig3 due to sample with Sig3 dominant has the maximum fraction in group 1. However, group 1 is the only group with Sig1 enriched. So we modify this result.

CNVGroupInfo$enrich_sig[CNVGroupInfo$group == "1"] = "Sig1"

Next, we process mutation data to get TMB, driver info, clusters, and etc (detail of some step has been described in R script of the repo).

# Processing mutation data ------------------------------------------------

load(file = "../output/PRAD_TCGA_plus_dbGap_Maf.RData")
load(file = "../output/Sig.PRAD_TCGA_plus_dbGap_Maf.RData")

Maf_samp = data.table::data.table(
  Tumor_Sample_Barcode = unique(rbind(Maf@data, Maf@maf.silent)$Tumor_Sample_Barcode)
)

TMBInfo <- Maf@variant.type.summary
TMBInfo$n_INDEL = TMBInfo$DEL + TMBInfo$INS
TMBInfo = TMBInfo[, .(Tumor_Sample_Barcode, n_INDEL, SNP)]
colnames(TMBInfo)[3] = "n_SNV"
TMBInfo = merge(Maf_samp, TMBInfo, by = "Tumor_Sample_Barcode", all = TRUE)
# Fill NAs
TMBInfo = TMBInfo %>%
  dtplyr::lazy_dt() %>%
  dplyr::mutate_at(vars(dplyr::starts_with("n_")), ~ifelse(is.na(.), 0, .)) %>%
  data.table::as.data.table()

load(file = "../output/PRAD_driver_info.RData")
load(file = "../output/PRAD_heter_info.RData")

SNVGroupInfo <- get_groups(Sig.SNV, method = "consensus", match_consensus = TRUE)
#> ℹ [2020-06-04 14:43:40]: Started.
#> ✓ [2020-06-04 14:43:40]: 'Signature' object detected.
#> ℹ [2020-06-04 14:43:40]: Obtaining clusters from the hierarchical clustering of the consensus matrix...
#> ℹ [2020-06-04 14:43:43]: Finding the dominant signature of each group...
#> => Generating a table of group and dominant signature:
#>    
#>     Sig1 Sig2 Sig3
#>   1   16   34   21
#>   2   30    0  193
#>   3  677    0    0
#> => Assigning a group to a signature with the maxium fraction (stored in 'map_table' attr)...
#> ℹ [2020-06-04 14:43:44]: Summarizing...
#>  group #1: 71 samples with Sig2 enriched.
#>  group #2: 223 samples with Sig3 enriched.
#>  group #3: 677 samples with Sig1 enriched.
#> ! [2020-06-04 14:43:44]: The 'enrich_sig' column is set to dominant signature in one group, please check and make it consistent with biological meaning (correct it by hand if necessary).
#> ℹ [2020-06-04 14:43:44]: 3.223 secs elapsed.
SNVExposureInfo <- get_sig_exposure(Sig.SNV)

# Processing gene and pathway mutation ------------------------------------

load(file = "../output/PRAD_gene_and_pathway_mutation.RData")

Next, we keep keys of all data.frames are same and merge them into one.

# Merge data --------------------------------------------------------------
Info <- Info %>%
  dplyr::mutate(
    CNV_ID = dplyr::case_when(
      !startsWith(tumor_Run, "TCGA") & !is.na(tumor_Run) ~ paste(subject_id, tumor_Run, sep = "-"),
      startsWith(tumor_Run, "TCGA") & !is.na(tumor_Run) ~ tumor_Run,
      TRUE ~ NA_character_
    )
  )

PurityInfo
#> # A tibble: 938 x 3
#>    purity ploidy sample             
#>     <dbl>  <dbl> <chr>              
#>  1   0.59    1.9 00-000450-SRR474658
#>  2   0.55    1.9 00-1165-SRR461844  
#>  3   0.75    1.9 00-160-SRR471373   
#>  4   0.72    2   00-1823-SRR471613  
#>  5   0.73    2   01-1934-SRR471793  
#>  6   0.4     1.9 01-2382-SRR471433  
#>  7   0.45    4.1 01-2492-SRR471733  
#>  8   0.73    2   01-2554-SRR471673  
#>  9   0.22    3.4 01-28-SRR471073    
#> 10   0.35    2   02-1082-SRR472616  
#> # … with 928 more rows
colnames(CNVGroupInfo) <- c("sample", "cnv_group", "cnv_weight", "cnv_enrich_sig")
CNVInfo
#>                  sample n_of_cnv n_of_amp n_of_del n_of_vchr cna_burden
#>   1:    TCGA-HI-7169-01        2        2        0         0      0.000
#>   2:    TCGA-CH-5743-01        3        2        1         1      0.000
#>   3:    TCGA-EJ-7791-01        3        3        0         1      0.000
#>   4:    TCGA-FC-A66V-01        3        3        0         1      0.000
#>   5:    TCGA-J4-AATV-01        3        3        0         0      0.000
#>  ---                                                                   
#> 933: 5115609-SRR8311697      303      291       12        22      0.856
#> 934: 5115412-SRR8311627      356      354        2        22      0.948
#> 935: 5115406-SRR8312068      367      337       30        22      0.297
#> 936:      36-SRR5876749      393      392        1        22      0.926
#> 937: 5115615-SRR8311749      420      406       14        22      0.360

colnames(CNVExposureInfo) <- c("sample", paste0("CN-", colnames(CNVExposureInfo)[-1]))
colnames(TMBInfo)[1] <- "sample"
TMBInfo[, sample := as.character(sample)][, sample := ifelse(startsWith(sample, "TCGA"),
                                                             substr(sample, 1, 15),
                                                             sample
)]

colnames(DriverDF)[1] = "sample"
DriverDF[, sample := as.character(sample)][, sample := ifelse(startsWith(sample, "TCGA"),
                                                              substr(sample, 1, 15),
                                                              sample
)]
colnames(SNVGroupInfo) <- c("sample", "snv_group", "snv_weight", "snv_enrich_sig")
SNVGroupInfo[, sample := as.character(sample)][, sample := ifelse(startsWith(sample, "TCGA"),
                                                                  substr(sample, 1, 15),
                                                                  sample
)]
colnames(SNVExposureInfo) <- c("sample", paste0("SBS-", colnames(SNVExposureInfo)[-1]))
SNVExposureInfo[, sample := as.character(sample)][, sample := ifelse(startsWith(sample, "TCGA"),
                                                                     substr(sample, 1, 15),
                                                                     sample
)]
colnames(TitvInfo) <- c("sample", "Ti_fraction", "Tv_fraction")
TitvInfo[, sample := as.character(sample)][, sample := ifelse(startsWith(sample, "TCGA"),
                                                              substr(sample, 1, 15),
                                                              sample
)]
colnames(MathDF) <- c("sample", "MATH")
MathDF[, sample := as.character(sample)][, sample := ifelse(startsWith(sample, "TCGA"),
                                                            substr(sample, 1, 15),
                                                            sample
)]
colnames(ClusterDF) <- c("sample", "n_mutation_cluster")
ClusterDF[, sample := as.character(sample)][, sample := ifelse(startsWith(sample, "TCGA"),
                                                               substr(sample, 1, 15),
                                                               sample
)]

# data.table::setDT(summary_mutation)
# data.table::setDT(summary_pathway)

colnames(summary_mutation)[1] <- "sample"
colnames(summary_pathway)[1] <- "sample"
summary_mutation[, sample := as.character(sample)][, sample := ifelse(startsWith(sample, "TCGA"),
                                                                      substr(sample, 1, 15),
                                                                      sample
)]
summary_pathway[, sample := as.character(sample)][, sample := ifelse(startsWith(sample, "TCGA"),
                                                                     substr(sample, 1, 15),
                                                                     sample
)]


colnames(CNVInfo)[2:4] = c("n_CNV", "n_Amp", "n_Del")

MergeInfo <- Info[which(!is.na(Info$tumor_Run)), ] %>%
  left_join(CNVInfo[, .(sample, n_CNV, n_Amp, n_Del)], by = c("CNV_ID" = "sample")) %>%
  left_join(CNVscores, by = c("CNV_ID" = "sample")) %>%
  left_join(CNVGroupInfo, by = c("CNV_ID" = "sample")) %>%
  left_join(CNVExposureInfo, by = c("CNV_ID" = "sample")) %>%
  left_join(PurityInfo, by = c("CNV_ID" = "sample")) %>%
  left_join(summary_mutation, by = c("tumor_Run" = "sample")) %>%
  left_join(summary_pathway, by = c("tumor_Run" = "sample")) %>%
  left_join(TMBInfo, by = c("tumor_Run" = "sample")) %>%
  left_join(DriverDF, by = c("tumor_Run" = "sample")) %>%
  dplyr::mutate(
    n_driver = ifelse(!is.na(n_driver), n_driver, 0)
  ) %>%
  left_join(TitvInfo, by = c("tumor_Run" = "sample")) %>%
  left_join(MathDF, by = c("tumor_Run" = "sample")) %>%
  left_join(ClusterDF, by = c("tumor_Run" = "sample")) %>%
  left_join(SNVGroupInfo, by = c("tumor_Run" = "sample")) %>%
  left_join(SNVExposureInfo, by = c("tumor_Run" = "sample")) %>%
  mutate(
    Stage = factor(Stage, ordered = TRUE),
    Fusion = ifelse(Fusion == "Negative", "No", "Yes"),
    sample_type = ifelse(sample_type == "Unknown", "Metastatic", sample_type), # phs000909 are metastatic samples, https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000909.v1.p1
    HasFusion = Fusion,
    HasFusion = ifelse(HasFusion == "Yes", TRUE, FALSE),
    IsMetastatic = ifelse(sample_type == "Metastatic", TRUE, FALSE)
  ) %>% 
  select(-sTDP, -lTDP, -sTDP_size, -lTDP_size, -TDP_pnas)


MergeInfo$TDP = MergeInfo$TDP * MergeInfo$TDP_size

saveRDS(MergeInfo, file = "../output/PRAD_Merge_Info_CNV_from_sequenza_update.RDS")

Association analysis

Now we do association analysis between signatures (via exposure) and genotypes/phenotypes.

Prepare

Load packages.

library(tidyverse)
library(sigminer)

Load the data.

df.seqz = readRDS(file = "../output/PRAD_Merge_Info_CNV_from_sequenza_update.RDS")

Set the coloumns used for analysis.

colnames(df.seqz)[c(22:23)] = c("TD count", "TDP score")
colnames(df.seqz)[68] = "n_SBS"

# Just use ploidy from sequenza
df.seqz$Ploidy = NULL

# Merge BRCA1 and BRCA2
df.seqz[["BRCA1/2"]] = df.seqz$BRCA1 | df.seqz$BRCA2
# Merge PIK3CA and PIK3CB
df.seqz[["PIK3CA/B"]] = df.seqz$PIK3CA | df.seqz$PIK3CB

# Output for reporting as supp table
openxlsx::write.xlsx(df.seqz, file = "../output/supp_sample_table.xlsx")
# Save for futher use
saveRDS(df.seqz, file = "../output/df.seqz.RDS")

cols_to_sigs.seqz <- c(paste0("CN-Sig", 1:5), paste0("SBS-Sig", 1:3))
# genes
cols_to_mutated_genes <- c(colnames(df.seqz)[c(37:40, 43:53)], "BRCA1/2", "PIK3CA/B")
# pathways
cols_to_mutated_pathways <- colnames(df.seqz)[54:65]

# Exclude PSA
cols_to_features <- c(
  "IsMetastatic", "HasFusion",
  sort(c("Age", "Stage", "GleasonScore",
       "n_SBS", "MATH", "n_INDEL", "Ti_fraction", "Tv_fraction",
       "cnaBurden", "n_CNV", "n_Amp", "n_Del", "ploidy",
       "TDP score", 
       "Chromoth_state",
       "purity"))
)

feature_type <- c(rep("ca", 2L), rep("co", length(cols_to_features) - 2L))

Signatures & features

Analyze the association between signatures and features.

tidy_data.seqz.feature <- get_sig_feature_association(df.seqz,
                                                      cols_to_sigs = cols_to_sigs.seqz,
                                                      cols_to_features = cols_to_features,
                                                      method_co = "pearson",
                                                      type = feature_type, 
                                                      min_n = 2,
                                                      verbose = TRUE) %>%
  get_tidy_association(p_adjust = TRUE)  # adjust p value with FDR
#> -> Detecting and transforming possibly ordinal variables...
#> -> Calculating correlations for continuous variables using pearson...
#> -> Calculating correlations for categorical variables using wilcox.test...
#> --> building test models, be patient...
#> --> done
#> --> obtaining model results...
#> --> done
#> --> collecting data...
#> --> done
#> -> Done

Association of signature exposures with other features was performed using one of two procedures: for a continuous association variable (including ordinal variable), pearson correaltion was performed; for a binary association variable, samples were divided into two groups and Mann-Whitney U-test was performed to test for differences in signature exposure medians between the two groups.

Show the result.

show_sig_feature_corrplot(tidy_data.seqz.feature, p_val = 0.05)

All shown circles are statistically significant results (FDR < 0.05). A feature will be chopped off from the plot if has no correlation with any signatures. Same for the similar plots below.

The result table:

DT::datatable(tidy_data.seqz.feature)

Signatures & mutated genes

Analyze the association between signatures and mutated genes (only driver genes identified by MutSig).

tidy_data.seqz.gene <- get_sig_feature_association(df.seqz,
                                                   cols_to_sigs = cols_to_sigs.seqz,
                                                   cols_to_features = cols_to_mutated_genes,
                                                   min_n = 2,
                                                   type = "ca", 
                                                   verbose = TRUE) %>%
  get_tidy_association(p_adjust = TRUE)
#> -> Detecting and transforming possibly ordinal variables...
#> -> Calculating correlations for categorical variables using wilcox.test...
#> --> building test models, be patient...
#> --> done
#> --> obtaining model results...
#> --> done
#> --> collecting data...
#> --> done
#> -> Done

sum_genes = tidy_data.seqz.gene %>% 
  dplyr::filter(p <= 0.05) %>% 
  dplyr::summarise(min = min(measure, na.rm = T),
                   max = max(measure, na.rm = T),
                   sum = list(summary(measure)))

sum_genes$sum
#> [[1]]
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> -17.300  -3.830   6.675  17.086  20.550 286.000

# myPalette <- colorRampPalette(rev(brewer.pal(11, "RdYlBu")))
# sc <- scale_colour_gradientn(colours = myPalette(100), limits=c(-20, 600))

sc <- scale_colour_gradientn(
  colors = c("lightblue", "white", "orange", "red"),
  values = scales::rescale(c(
   sum_genes$min,
    0,
    20,
    sum_genes$max
  )))

show_sig_feature_corrplot(tidy_data.seqz.gene, 
                          ylab = "Mutated genes",
                          ca_gradient_colors = sc,
                          p_val = 0.05) # p_val needs to keep in line with the sum_genes p <= 0.05

The result table:

DT::datatable(tidy_data.seqz.gene)

Signatures & mutated pathways

Analyze the association between signatures and mutated pathways.

tidy_data.seqz.pathways <- get_sig_feature_association(df.seqz,
                                                       cols_to_sigs = cols_to_sigs.seqz,
                                                       cols_to_features = cols_to_mutated_pathways,
                                                       type = "ca", 
                                                       min_n = 2,
                                                       verbose = TRUE) %>%
  get_tidy_association(p_adjust = TRUE)
#> -> Detecting and transforming possibly ordinal variables...
#> -> Calculating correlations for categorical variables using wilcox.test...
#> --> building test models, be patient...
#> --> done
#> --> obtaining model results...
#> --> done
#> --> collecting data...
#> --> done
#> -> Done

sum_pathways = tidy_data.seqz.pathways %>% 
  dplyr::filter(p <= 0.05) %>% 
  dplyr::summarise(min = min(measure, na.rm = T),
                   max = max(measure, na.rm = T),
                   sum = list(summary(measure)))

sum_pathways$sum
#> [[1]]
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  -4.130   7.393  16.700  61.573  75.125 768.000

sc <- scale_colour_gradientn(
  colors = c("lightblue", "white", "orange", "red"),
  values = scales::rescale(c(
    sum_pathways$min,
    0, 
    20,
    sum_pathways$max
  )))


show_sig_feature_corrplot(tidy_data.seqz.pathways,
                          ylab = "Mutated pathways",
                          ca_gradient_colors = sc,
                          p_val = 0.05)

The result table:

DT::datatable(tidy_data.seqz.pathways)

Correlation network

To determine the structure of relationship between signatures and features, we use continuous variables to do correlation network analysis.

f_dt = df.seqz[, c(cols_to_sigs.seqz, cols_to_features[c(-1, -2)])] %>% 
  dplyr::mutate_if(is.ordered, as.numeric)

library(corrr)

res.cor <- correlate(f_dt,
  use = "pairwise.complete.obs",
  method = "pearson"
) %>%
  dplyr::mutate_if(is.numeric, dplyr::coalesce, 0)
#> 
#> Correlation method: 'pearson'
#> Missing treated using: 'pairwise.complete.obs'

## Avoid corrr error
## https://github.com/tidymodels/corrr/issues/86
## https://github.com/tidymodels/corrr/issues/34
## https://github.com/tidymodels/corrr/issues/78
set.seed(1234)
res.cor[,-1] = res.cor[, -1] + matrix(
  rnorm(nrow(res.cor[, -1]) * ncol(res.cor[, -1]), mean = 0, sd = 0.01),
  nrow = nrow(res.cor[, -1]),
  byrow = TRUE
)

p = res.cor %>%
  network_plot(min_cor = 0.2,
               colours = rev(c("indianred2", "white", "skyblue1")))

#png("corrr_network.png", width = 7, height = 7, res = 300)
p + scale_size(range=c(0.1, 2))
#> Scale for 'size' is already present. Adding another scale for 'size', which
#> will replace the existing scale.

#ggsave("corrr_network.pdf", plot = p + scale_size(range=c(0.1, 2)))
#dev.off()

variables that are more highly correlated appear closer together and are joined by stronger paths. Paths are also colored by their sign. The proximity of the points are determined using multidimensional clustering.

Signatures & PSA

In all datasets we collected, 3 of them have PSA info. However, the range of PSA value in 3 datasets differs, so I excluded PSA from features above and analyze it independently.

df.seqz.psa = df.seqz %>%
  dplyr::filter(!is.na(PSA)) %>%
  dplyr::select(contains("Sig", ignore.case = F), c("PSA", "Study", "CNV_ID"))

df.seqz.psa <- df.seqz.psa %>%
  dplyr:::group_by(Study) %>%
  tidyr::nest() %>%
  dplyr::summarise(
    data = purrr:::map(data, function(x) {
      get_sig_feature_association(x,
                                  cols_to_sigs = cols_to_sigs.seqz,
                                  cols_to_features = "PSA",
                                  method_co = "pearson",
                                  type = "co", verbose = TRUE) %>%
        get_tidy_association(p_adjust = TRUE)
    })
  ) %>%
  tidyr::unnest("data") %>%
  dplyr::filter(feature == "PSA") %>%
  dplyr::mutate(feature = Study)
#> -> Detecting and transforming possibly ordinal variables...
#> -> Calculating correlations for continuous variables using pearson...
#> -> Done
#> -> Detecting and transforming possibly ordinal variables...
#> -> Calculating correlations for continuous variables using pearson...
#> -> Done
#> -> Detecting and transforming possibly ordinal variables...
#> -> Calculating correlations for continuous variables using pearson...
#> -> Done

show_sig_feature_corrplot(df.seqz.psa, p_val = 1, breaks_count = c(0, 50, 100, 150, 200), ylab = "PSA")

Alought we can observe the statistically significant correlation between signatures and PSA, the results are inconsistent across different studies. This may be explained by the dynamic property of PSA while not the sample type due to most of them are primary.

table(df.seqz$sample_type, df.seqz$Study)
#>             
#>              phs000447 phs000554 phs000909 phs000915 phs001141 TCGA
#>   Metastatic         0        50        80       147        77    0
#>   Primary          112        11        28         0         0  498

Group Analysis

This part has been divided into four parts to show the differences across sample groups defined by signature exposure. It provides insight into the heterogeneity of prostate cancers and relative importance of two types of genomic variation signatures.

Load packages.

library(tidyverse)
library(sigminer)

Load the data.

df.seqz = readRDS(file = "../output/df.seqz.RDS")

Group comparison

cols_to_features <- c(
  "IsMetastatic", "Stage", "GleasonScore",
  sort(c("Age", 
  "n_SBS",
  "n_INDEL", 
  "Ti_fraction", "Tv_fraction",
  "cnaBurden", "n_CNV", "n_Amp", "n_Del", "ploidy",
  "TDP score",
  "Chromoth_state",
  "MATH", "purity"))
)

feature_type <- c(rep("ca", 3L), rep("co", length(cols_to_features) - 3L))

df.seqz$GleasonScore = as.character(df.seqz$GleasonScore)
df.seqz$GleasonScore = factor(df.seqz$GleasonScore, levels = as.character(6:10))

df.seqz2 = df.seqz
df.seqz2$cnv_enrich_sig = ifelse(
  !is.na(df.seqz2$cnv_enrich_sig),
  paste(df.seqz2$cnv_enrich_sig, "enriched"),
  NA
)

grp.cnv <- get_group_comparison(
  df.seqz2,
  col_group = "cnv_enrich_sig", # or "cnv_group"
  cols_to_compare = cols_to_features,
  type = feature_type
)

plot.cnv <- show_group_comparison(
  grp.cnv,
  xlab = NA,
  method = "anova",
  text_angle_x = 60,
  text_hjust_x = 1,
  legend_position_ca = "right",
  label.x.npc = "center", 
  label.y.npc = "top",
  label = "p.format",
  hjust = 0.5
)

cowplot::plot_grid(plotlist = plot.cnv$ca, align = "hv", ncol = 2)

cowplot::plot_grid(plotlist = plot.cnv$co, align = "hv", ncol = 4)

df.seqz2$snv_enrich_sig = ifelse(
  !is.na(df.seqz2$snv_enrich_sig),
  paste(df.seqz2$snv_enrich_sig, "enriched"),
  NA
)

grp.snv <- get_group_comparison(
  df.seqz2,
  col_group = "snv_enrich_sig", # or "snv_group"
  cols_to_compare = cols_to_features,
  type = feature_type
)

plot.snv <- show_group_comparison(
  grp.snv,
  xlab = NA,
  method = "anova",
  text_angle_x = 60,
  text_hjust_x = 1,
  legend_position_ca = "right",
  label.x.npc = "center", 
  label.y.npc = "top",
  label = "p.format",
  hjust = 0.5
)

cowplot::plot_grid(plotlist = plot.snv$ca, align = "hv", ncol = 2)

cowplot::plot_grid(plotlist = plot.snv$co, align = "hv", ncol = 4)

What if there are 5 SBS signatures and samples are divided into 5 groups?

load("../output/Sig5.Maf.RData")
sbs_grp = get_groups(Sig.SNV5, method = "consensus", match_consensus = TRUE)
#>    
#>     Sig1 Sig2 Sig3 Sig4 Sig5
#>   1    0    0    3   33    1
#>   2    0    0    0    0  198
#>   3   69   21    7    3   42
#>   4    1   14  216   13   44
#>   5    0  272    1    4   29
sbs_grp$sample = ifelse(startsWith(sbs_grp$sample, "TCGA"),
                        substr(sbs_grp$sample, 1, 15),
                        sbs_grp$sample)

df.seqz3 = dplyr::left_join(
  sbs_grp,
  df.seqz2,
  by = c("sample" = "tumor_Run")
)

df.seqz3$enrich_sig = ifelse(
  !is.na(df.seqz3$enrich_sig),
  paste(df.seqz3$enrich_sig, "enriched"),
  NA
)

grp.snv2 <- get_group_comparison(
  df.seqz3,
  col_group = "enrich_sig", # or "snv_group"
  cols_to_compare = cols_to_features,
  type = feature_type
)

plot.snv2 <- show_group_comparison(
  grp.snv2,
  xlab = NA,
  method = "anova",
  text_angle_x = 60,
  text_hjust_x = 1,
  legend_position_ca = "right",
  label.x.npc = "center", 
  label.y.npc = "top",
  label = "p.format",
  hjust = 0.5
)

cowplot::plot_grid(plotlist = plot.snv2$ca, align = "hv", ncol = 2)

cowplot::plot_grid(plotlist = plot.snv2$co, align = "hv", ncol = 4)

Group mapping

df.map = df.seqz[, c("sample_type",
                     "cnv_enrich_sig",
                     "snv_enrich_sig")] %>% 
  na.omit()
names(df.map) = c("Sample type", "Enriched CN signature", "Enriched SBS signature")

p = show_group_mapping(df.map, 
                   col_to_flow = "Sample type", 
                   cols_to_map = setdiff(colnames(df.map), "Sample type"),
                   fill_na = "NA", include_sig = TRUE)
p

Grouped signature exposure profile

We have showed the exposure profiles, here we add group annotation to them.

load("../output/Sig.PRAD_TCGA_plus_dbGap_Maf.RData")
load("../output/Sig.CNV.seqz.W.RData")

Copy number signatures

map_df = df.seqz %>% 
  dplyr::select(cnv_enrich_sig, CNV_ID) %>% 
  dplyr::filter(!is.na(CNV_ID))

groups = map_df$cnv_enrich_sig
names(groups) = map_df$CNV_ID

p = show_sig_exposure(Sig.CNV.seqz.W,
                  groups = groups, 
                  #grp_order = , 
                  rm_space = TRUE, 
                  style = "cosmic",
                  rm_grid_line = FALSE,
                  rm_panel_border = TRUE,
                  grp_size = 7)
p

SBS signatures

map_df = df.seqz %>% 
  dplyr::select(tumor_Run, snv_enrich_sig) %>% 
  dplyr::filter(!is.na(tumor_Run))

groups = map_df$snv_enrich_sig
names(groups) = map_df$tumor_Run

Sig.SNV2 = Sig.SNV

TCGA_INDEX = startsWith(colnames(Sig.SNV2$Exposure), "TCGA")
colnames(Sig.SNV2$Exposure)[TCGA_INDEX] = substr(colnames(Sig.SNV2$Exposure)[TCGA_INDEX], 1, 15)

p = show_sig_exposure(Sig.SNV2,
                  groups = groups,
                  cutoff = 2000,
                  rm_space = TRUE, 
                  style = "cosmic",
                  rm_grid_line = FALSE,
                  rm_panel_border = TRUE)
p

Copy number profile of different groups

There are many differences across copy number groups. Here we go further showing some classic cases by copy number profile.

load("../output/CNV.seqz.RData")
load("../output/Sig.CNV.seqz.W.RData")
load("../output/CNV.seqz.tally.W.RData")

There are 5 copy number groups dominated by 4 copy number signatures.

table(df.seqz$cnv_group, df.seqz$cnv_enrich_sig)
#>    
#>     Sig1 Sig2 Sig3 Sig4 Sig5
#>   1   35    0    0    0    0
#>   2    0   63    0    0    0
#>   3    0    0  224    0    0
#>   4    0    0    0  170    0
#>   5    0    0    0    0  445

Get the groups and relative exposures and merge them.

cn_group <- get_groups(Sig.CNV.seqz.W, method = "consensus", match_consensus = TRUE)
#>    
#>     Sig1 Sig2 Sig3 Sig4 Sig5
#>   1    9    2   14   10    0
#>   2    0   60    0    3    0
#>   3    0    3  219    2    0
#>   4    0    0    0  170    0
#>   5    0    3   35  200  207
cn_group$enrich_sig[cn_group$group == "1"] = "Sig1"
cn_expo <- get_sig_exposure(Sig.CNV.seqz.W, type = "relative")

df <- dplyr::left_join(cn_group, cn_expo)

Now we can use data to find the cases with a specific signature enriched.

Sig1 cases

The group with enriched Sig1 is dominated by Sig3, we need to take case of it.

samps_to_show = df %>%
    dplyr::filter(enrich_sig == "Sig1") %>%
    dplyr::arrange(dplyr::desc(Sig1)) %>%
    dplyr::slice(1:6) %>% dplyr::pull(sample)

samps_to_show
#> [1] "5115056-SRR8311885" "TCGA-CH-5748-01"    "18-SRR5876729"     
#> [4] "5115276-SRR8312040" "3-SRR4044023"       "5115198-SRR8311993"
                                                                                                                                                                                                                               
show_cn_profile(
  data = CNV.seqz, nrow = 3, ncol = 2, show_title = T,
  samples = samps_to_show 
)

Of note, only the segments with high copy number are contributed by Sig1.

Sig2 cases

samps_to_show = df %>%
    dplyr::filter(enrich_sig == "Sig2") %>%
    dplyr::arrange(dplyr::desc(Sig2)) %>%
    dplyr::slice(1:6) %>% dplyr::pull(sample)

samps_to_show
#> [1] "5115615-SRR8311749" "5115406-SRR8312068" "45-SRR5876761"     
#> [4] "36-SRR5876749"      "1115153-SRR2404067" "5115152-SRR8311933"

show_cn_profile(
  data = CNV.seqz, nrow = 3, ncol = 2, show_title = T,
  samples = samps_to_show
)

Sig3 cases

samps_to_show = df %>%
    dplyr::filter(enrich_sig == "Sig3") %>%
    dplyr::arrange(dplyr::desc(Sig3)) %>%
    dplyr::slice(1:6) %>% dplyr::pull(sample)

samps_to_show
#> [1] "TCGA-HC-7745-01"    "1115082-SRR8304741" "TCGA-KK-A7B4-01"   
#> [4] "TCGA-ZG-A9L0-01"    "6115317-SRR8307505" "1115091-SRR8304759"

show_cn_profile(
  data = CNV.seqz, nrow = 3, ncol = 2, show_title = T,
  samples = samps_to_show
)

Sig4 cases

samps_to_show = df %>%
    dplyr::filter(enrich_sig == "Sig4") %>%
    dplyr::arrange(dplyr::desc(Sig4)) %>%
    dplyr::slice(1:6) %>% dplyr::pull(sample)

samps_to_show
#> [1] "TCGA-KK-A59X-01"   "06-1749-SRR472297" "TCGA-XK-AAJA-01"  
#> [4] "05-3852-SRR472039" "03-1426-SRR465855" "TCGA-J4-A67N-01"

show_cn_profile(
  data = CNV.seqz, nrow = 3, ncol = 2, show_title = T,
  samples = samps_to_show
)

Sig5 cases

samps_to_show = df %>%
    dplyr::filter(enrich_sig == "Sig5") %>%
    dplyr::arrange(dplyr::desc(Sig5)) %>%
    dplyr::slice(1:6) %>% dplyr::pull(sample)

samps_to_show
#> [1] "TCGA-KC-A7FE-01"  "TCGA-J4-AATV-01"  "TCGA-HI-7169-01"  "TCGA-FC-A66V-01" 
#> [5] "TCGA-EJ-7791-01"  "09-120-SRR460104"

show_cn_profile(
  data = CNV.seqz, nrow = 3, ncol = 2, show_title = T,
  samples = samps_to_show
)

Sig5 enriched case with alterations.

samps_to_show = df %>%
    dplyr::filter(enrich_sig == "Sig5") %>%
    dplyr::left_join(as.data.frame(CNV.seqz.tally.W$nmf_matrix) %>%
                tibble::rownames_to_column("sample") %>%
                dplyr::select(sample, `BPArm[2]`, `BPArm[1]`)) %>%
    dplyr::mutate(BK = `BPArm[2]` + `BPArm[1]`) %>%
    dplyr::arrange(desc(BK)) %>%
    dplyr::slice(1:6) %>% dplyr::pull(sample)

samps_to_show
#> [1] "TCGA-EJ-5515-01"    "32-SRR4043905"      "TCGA-EJ-5509-01"   
#> [4] "6115124-SRR8307108" "1115265-SRR8304838" "TCGA-EJ-5532-01"

show_cn_profile(
  data = CNV.seqz, nrow = 3, ncol = 2, show_title = T,
  samples = samps_to_show
)

Selected cases

Select one sample for one case above.

p = show_cn_profile(
  data = CNV.seqz, nrow = 5, ncol = 1, show_title = T,
  samples = c("5115056-SRR8311885", "5115615-SRR8311749", "TCGA-KK-A7B4-01", "TCGA-KK-A59X-01", "TCGA-KC-A7FE-01")
)
p

pdf("Selected_cases_circos_heatmap.pdf", width = 2.5, height = 13)
samps = c("5115056-SRR8311885", "5115615-SRR8311749", "TCGA-KK-A7B4-01", "TCGA-KK-A59X-01", "TCGA-KC-A7FE-01")

opar = par(no.readonly = TRUE)
print(par("mar"))
par(mar = rep(0, 4))

layout(matrix(1:5, ncol = 1))
for (samp in samps) {
  circlize::circos.par(track.margin=c(0,0)) 
  show_cn_circos(CNV.seqz, samples = samp, show_title = FALSE, 
               col = circlize::colorRamp2(c(1, 2, 4), c("blue", "white", "red")),
               line_lwd = par("lwd") / 4)
}
layout(1)
par(opar)
dev.off()

# p_gg = show_cn_profile(
#   data = CNV.seqz, nrow = 6, ncol = 1, show_title = T,
#   samples = samps
# )
# 
# p_base = function() {
#   opar = par(no.readonly = TRUE)
#   par(mar = c(0, 0, 0, 0))
#   on.exit(par(opar))
#   
#   layout(matrix(1:5, ncol = 1))
#   show_cn_circos(CNV.seqz, samples = samps, show_title = FALSE, line_lwd = par("lwd") / 4)
#   layout(1)
# }

# cowplot::plot_grid(
#   p_gg,
#   p_base,
#   rel_widths = c(3, 1),
#   scale = c(1.0, 0.9)
# )

Survival Analysis

To check the clinical relavance of signatures, this part we do survival analysis with Cox model.

Prepare

Load package ezcox for batch Cox analysis.

library(ezcox)
library(ggplot2)

Load the data.

df.seqz = readRDS(file = "../output/df.seqz.RDS")
surv_df <- readRDS(file = "../data/PRAD_Survival.rds")

Clean the data to generate a data table for survival analysis.

range01 <- function(x, ...) {
  (x - min(x, ...)) / (max(x, ...) - min(x, ...))
}

cols_to_sigs.seqz <- c(paste0("CN-Sig", 1:5), paste0("SBS-Sig", 1:3))


surv_dt <- dplyr::inner_join(df.seqz %>%
                               dplyr::select(-c("Study", "subject_id", "tumor_body_site",
                                                "tumor_Run", "normal_Run", "CNV_ID",
                                                "Fusion")),
                             surv_df, by = c("PatientID" = "sample"))

dup_ids = which(duplicated(surv_dt$PatientID))
# Remove duplicated records
surv_dt = surv_dt[-dup_ids, ]

# Scale the signature exposure by dividing 10
# Scale some variables to 0-20.
surv_dt = surv_dt %>%
  dplyr::select(-Tv_fraction) %>% 
  dplyr::mutate(cnaBurden = 20 * cnaBurden,
                Stage = as.character(Stage) %>% factor(levels = c("T2", "T3", "T4"))) %>%
  dplyr::mutate_at(cols_to_sigs.seqz, ~./10) %>% 
  dplyr::mutate_at(c("Ti_fraction", "purity"), ~ 20 * range01(., na.rm = TRUE)) %>% 
  dplyr::mutate(OS.time = OS.time / 30.5,
                PFI.time = PFI.time / 30.5)


# Scale the signature exposure to 0-20.
# To better understand the effect of signature exposure on clinical events,
# here we scale them into range 0-20. 1 increase means 5% increase of signature exposure.
#
# Scale the CNA burden to 0-20.
#
# keep only ti_fraction, tv result will be 1/ti_result
# also scale it to 0-20
# also scale purity to 0-20
# surv_dt = surv_dt %>%
#   dplyr::select(-Tv_fraction) %>% 
#   dplyr::mutate(cnaBurden = 20 * cnaBurden,
#                 Stage = as.character(Stage) %>% factor(levels = c("T2", "T3", "T4"))) %>%
#   dplyr::mutate_at(c(cols_to_sigs.seqz, "Ti_fraction", "purity"), ~ 20 * range01(., na.rm = TRUE))

saveRDS(surv_dt, file = "../output/PRAD_merged_survival_dt_seqz.rds")

OS analysis fo signatures

############
surv_dt <- readRDS(file = "../output/PRAD_merged_survival_dt_seqz.rds")
# surv_dt <- surv_dt %>% dplyr::filter(startsWith(PatientID, "TCGA"))

cols_to_sigs.seqz <- c(paste0("CN-Sig", 1:5), paste0("SBS-Sig", 1:3))

# Unvariable analysis
p = show_forest(surv_dt, 
                covariates = cols_to_sigs.seqz,
            time = "OS.time", status = "OS", merge_models = TRUE)
p

# Multivariable analysis
# p = show_forest(surv_dt, covariates = cols_to_sigs.seqz, controls = c("GleasonScore", "purity", "Stage"),
#             time = "OS.time", status = "OS", merge_models = TRUE, drop_controls = TRUE)
p = show_forest(surv_dt,
                covariates = cols_to_sigs.seqz, controls = c("GleasonScore", "purity", "Stage"),
            time = "OS.time", status = "OS")
p

PFS analysis for signatures

p = show_forest(surv_dt, 
                covariates = cols_to_sigs.seqz,
            time = "PFI.time", status = "PFI", merge_models = TRUE)
p

# Multivariable analysis
p = show_forest(surv_dt,
                covariates = cols_to_sigs.seqz, controls = c("GleasonScore", "purity", "Stage"),
            time = "PFI.time", status = "PFI")
p

OS analysis for features

cols_to_features <- c(
  "Age", "Stage", "GleasonScore",
  "n_SBS", 
  "n_INDEL",
  "n_CNV", "n_Amp", "n_Del", 
  "cnaBurden",
  "Ti_fraction",
  "TDP score",
  "Chromoth_state",
  "MATH",
  "purity",
  "ploidy"
)

p = show_forest(surv_dt,
            covariates = cols_to_features, controls = NULL,
            time = "OS.time", status = "OS", merge_models = TRUE, limits = log(c(0.5, 10)))
p

PFS analysis for features

p = show_forest(surv_dt,
            covariates = cols_to_features, controls = NULL,
            time = "PFI.time", status = "PFI", merge_models = TRUE, limits = c(log(0.5), log(5)))
p

K-M curves

library(survival)
library(survminer)

legend_2 <- ggplot2::guides(color = ggplot2::guide_legend(nrow = 2, byrow = TRUE))

fit <- survfit(Surv(OS.time, OS) ~ 1, data = subset(surv_dt, sample_type == "Primary"))
ggsurvplot(fit, palette = "aaas", xlab = "Time (months)")$plot

fit <- survfit(Surv(OS.time, OS) ~ 1, data = subset(surv_dt, sample_type != "Primary"))
ggsurvplot(fit, palette = "aaas", xlab = "Time (months)")$plot

fit <- survfit(Surv(PFI.time, PFI) ~ 1, data = subset(surv_dt, sample_type == "Primary"))
ggsurvplot(fit, palette = "aaas", xlab = "Time (months)", ylab = "Disease progression probability")$plot

fit <- survfit(Surv(OS.time, OS) ~ cnv_enrich_sig, data = subset(surv_dt, sample_type == "Primary"))
ggsurvplot(fit, palette = "aaas", xlab = "Time (months)", pval = TRUE)$plot + legend_2

fit <- survfit(Surv(OS.time, OS) ~ cnv_enrich_sig, data = subset(surv_dt, sample_type != "Primary"))
ggsurvplot(fit, palette = "aaas", xlab = "Time (months)", pval = TRUE)$plot + legend_2

fit <- survfit(Surv(OS.time, OS) ~ snv_enrich_sig, data = subset(surv_dt, sample_type == "Primary"))
ggsurvplot(fit, palette = "aaas", xlab = "Time (months)", pval = TRUE)$plot + legend_2

fit <- survfit(Surv(OS.time, OS) ~ snv_enrich_sig, data = subset(surv_dt, sample_type != "Primary"))
ggsurvplot(fit, palette = "aaas", xlab = "Time (months)", pval = TRUE)$plot + legend_2

fit <- survfit(Surv(PFI.time, PFI) ~ cnv_enrich_sig, data = subset(surv_dt, sample_type == "Primary"))
ggsurvplot(fit, palette = "aaas", xlab = "Time (months)", ylab = "Disease progression probability",
           pval = TRUE)$plot + legend_2

fit <- survfit(Surv(PFI.time, PFI) ~ snv_enrich_sig, data = subset(surv_dt, sample_type == "Primary"))
ggsurvplot(fit, palette = "aaas", xlab = "Time (months)", ylab = "Disease progression probability",
           pval = TRUE)$plot + legend_2

No PFI data for metastatic tumors.

Variation Summary

This part discribes some big pictures of genomic variations in prostate cancers. They are not directly related to the core analyses previously reported.

library(sigminer)
library(maftools)

load("../output/CNV.seqz.RData")
load("../output/CNV.seqz.tally.W.RData")
load("../output/PRAD_TCGA_plus_dbGap_Maf.RData")

Oncoplot

A simple oncoplot.

oncoplot(maf = Maf, top = 10)

Refernce Maftools vignette: http://bioconductor.org/packages/release/bioc/vignettes/maftools/inst/doc/oncoplots.html

To get a comprehensive and meaningful oncoplot (not affected by passagers), here we include GISTIC2, MutSig and MAF information together.

Before plotting, I will go to modify a result file from GISTIC2, this file uses a different sample ID system as MAF, so we must keep them same.

all.lesions <- "../data/all_lesions.conf_99.txt"
dt = data.table::fread(all.lesions)
dt_names = colnames(dt)

dt_names2 = dt_names
dt_names = ifelse(
  stringr::str_detect(dt_names, "SRR"),
  stringr::str_extract(dt_names, "SRR[0-9]+$"),
  dt_names
)
names(dt_names) = dt_names2

maf_names = maftools::getSampleSummary(Maf)$Tumor_Sample_Barcode %>% as.character()
tcga_names = maf_names[startsWith(maf_names, "TCGA")]
names(tcga_names) = substr(tcga_names, 1, 15)

dt_names[startsWith(dt_names, "TCGA")] = tcga_names[dt_names[startsWith(dt_names, "TCGA")]]
dt_names[is.na(dt_names)] = names(dt_names[is.na(dt_names)])


colnames(dt) = dt_names %>% as.character()
data.table::fwrite(dt, file = "../data/all_lesions.conf_99_v2.txt", sep = "\t")

# GISTIC2 data
all.lesions <- "../data/all_lesions.conf_99_v2.txt"
amp.genes <- "../data/amp_genes.conf_99.txt"
del.genes <- "../data/del_genes.conf_99.txt"
scores.gis <- "../data/scores.gistic"

# MutSig2CV data
mutsig <- "../data/PRAD.sig_genes.txt"

maf_plus_gistic <- read.maf(
  maf = rbind(Maf@data, Maf@maf.silent),
  gisticAllLesionsFile = all.lesions,
  gisticAmpGenesFile = amp.genes, gisticDelGenesFile = del.genes,
  gisticScoresFile = scores.gis, isTCGA = FALSE, verbose = FALSE
)

oncoplot(maf_plus_gistic,
  mutsig = mutsig, mutsigQval = 0.05, sortByMutation = TRUE,
  logColBar = TRUE, sepwd_samples = 0, draw_titv = TRUE
)

This plot seems not right, talk with author of maftools…

Remove GISTIC2 data from oncoplot.

oncoplot(Maf,
  mutsig = mutsig, mutsigQval = 0.05, sortByMutation = TRUE,
  logColBar = TRUE, sepwd_samples = 0, draw_titv = TRUE
)

Sample summary

Here we summarize variations for each sample with table.

Copy number data

DT::datatable(CNV.seqz@summary.per.sample)

library(dplyr)

tb_cnv = CNV.seqz@summary.per.sample %>%
 dplyr::select(-sample) %>%
 psych::describe(quant=c(.25,.75)) %>% 
 as_tibble(rownames="rowname")   

DT::datatable(tb_cnv)

MAF data

DT::datatable(Maf@variant.type.summary)

library(dplyr)

tb_maf = Maf@variant.type.summary %>%
 dplyr::select(-Tumor_Sample_Barcode) %>%
 psych::describe(quant=c(.25,.75)) %>% 
 as_tibble(rownames="rowname")   

DT::datatable(tb_maf)

Summary plots

MAF

library(maftools)
library(ggplot2)

MAF summary

plotmafSummary(maf = Maf, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE, log_scale = TRUE)

Transition and Transversions

maf.titv = titv(maf = Maf, plot = FALSE, useSyn = TRUE)
#plot titv summary
plotTiTv(res = maf.titv)

Somatic Interactions

somaticInteractions(maf = Maf, top = 25, pvalue = c(0.05, 0.1))

#>       gene1  gene2       pValue  oddsRatio  00 11 01 10              Event
#>   1: ABCA13  KMT2C 5.107880e-08 10.2128843 896 13 42 27       Co_Occurence
#>   2:   RYR3 ABCA13 1.510688e-05  9.8852463 915  8 32 23       Co_Occurence
#>   3:  OBSCN    TTN 2.087377e-05  4.7803808 841 15 94 28       Co_Occurence
#>   4:  KMT2C  KMT2D 3.687480e-05  5.3043820 877 12 46 43       Co_Occurence
#>   5:  FOXA1  CDK12 4.320195e-05  5.8603172 873 11 23 71       Co_Occurence
#>  ---                                                                      
#>               pair event_ratio
#>   1: ABCA13, KMT2C       13/69
#>   2:  ABCA13, RYR3        8/55
#>   3:    OBSCN, TTN      15/122
#>   4:  KMT2C, KMT2D       12/89
#>   5:  CDK12, FOXA1       11/94
#>  ---                          
#>  [ reached getOption("max.print") -- omitted 5 rows ]

Copy number

Copy number segment distribution

The distribution can be viewed as a histgram of segment length.

p = show_cn_distribution(CNV.seqz, mode = "ld")
p

Length summary is:

summary(CNV.seqz@annotation$fraction)
#>      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
#> 0.0000001 0.0175022 0.1578263 0.4650056 0.8146213 1.9997271

If we treat SCNA with length > 0.7 as chromosome arm level or whole chromosome level alterations.

The fraction is:

print(nrow(CNV.seqz@annotation[fraction > 0.7 & segVal != 2]) / nrow(CNV.seqz@annotation[segVal != 2]))
#> [1] 0.1489081

The distribution can also be viewed from the alteration load across chromosomes.

p = show_cn_distribution(CNV.seqz, mode = "cd")
p

Feature distribution

p = show_cn_features(CNV.seqz.tally.W$features, method = "W", log_y = TRUE)
p

Component distribution

p = show_cn_components(CNV.seqz.tally.W$parameters, method = "W", log_y = TRUE)
p

Data details can be viewed as a table:

DT::datatable(CNV.seqz.tally.W$parameters)

n_obs column indicates the number of event of corresponding copy number component.

Supplementary Analysis and Visualization

This part shows some supplementary analyses and descriptions.

FACETS results

We used 2 methods (Sequenza and FACETS) for copy number calling. However, only Sequenza was used for previous downstream analyses due to the fact that FACETS generated some unreasonable copy number profile for a few samples. We will talk it here.

Let’s load the packages and signature data from FACETS and Sequenza.

library(sigminer)
library(ggplot2)

load("../output/Sig.CNV.facets.W.RData")
load("../output/Sig.CNV.seqz.W.RData")
load("../output/EST.facets.W.all.RData")

According to the estimation of signature number, 6 signatures were extracted.

p <- show_sig_number_survey(EST.facets.W.all, right_y = NULL)
add_h_arrow(p, x = 6.2, y = 0.986, seg_len = 0.5, space = 0.2)

The result signatures are shown as the below.

show_sig_profile(Sig.CNV.facets.W, 
                 mode = "copynumber",
                 style = "cosmic", method = "W", normalize = "feature") +
  labs(caption = "Signatures from FACETS")

FACETS result has one special and extra signature (Sig1) when compared to Sequenza result. The Sig1 shows copy number segments with high percentage of component “copy number value = 0”.

Other signatures are not identical but very similar in most features beteween FACETS and Sequenza results.

We check some samples with Sig1 enriched.

cn_groups = get_groups(Sig.CNV.facets.W, method = "consensus", match_consensus = TRUE)
#> ℹ [2020-06-04 14:49:04]: Started.
#> ✓ [2020-06-04 14:49:04]: 'Signature' object detected.
#> ℹ [2020-06-04 14:49:04]: Obtaining clusters from the hierarchical clustering of the consensus matrix...
#> ℹ [2020-06-04 14:49:06]: Finding the dominant signature of each group...
#> => Generating a table of group and dominant signature:
#>    
#>     Sig1 Sig2 Sig3 Sig4 Sig5 Sig6
#>   1    0    0   27    1    0    0
#>   2   19    2    0    7    0    0
#>   3    0   92    1    2    0    0
#>   4    7   52    8   59    9    5
#>   5    0    0    0  188    0    0
#>   6    5    2    0  155    0  292
#> => Assigning a group to a signature with the maxium fraction (stored in 'map_table' attr)...
#> ℹ [2020-06-04 14:49:06]: Summarizing...
#>  group #1: 28 samples with Sig3 enriched.
#>  group #2: 28 samples with Sig1 enriched.
#>  group #3: 95 samples with Sig2 enriched.
#>  group #4: 140 samples with Sig4 enriched.
#>  group #5: 188 samples with Sig4 enriched.
#>  group #6: 454 samples with Sig6 enriched.
#> ! [2020-06-04 14:49:06]: The 'enrich_sig' column is set to dominant signature in one group, please check and make it consistent with biological meaning (correct it by hand if necessary).
#> ℹ [2020-06-04 14:49:06]: 1.518 secs elapsed.
cn_expo = get_sig_exposure(Sig.CNV.facets.W, type = "relative")
# Focus on Sig1
df = dplyr::left_join(cn_groups, cn_expo)
#> Joining, by = "sample"
load("../output/CNV.facets.RData")

samps_to_show <- df %>%
  dplyr::filter(enrich_sig == "Sig1") %>%
  dplyr::arrange(dplyr::desc(Sig1)) %>%
  dplyr::slice(1:6) %>%
  dplyr::pull(sample)

samps_to_show
#> [1] "915-1115081"        "141-6"              "447-STID0000003036"
#> [4] "915-5115031"        "141-25"             "915-1115090"

show_cn_profile(
  data = CNV.facets, chrs = paste0("chr", c(1:22, "X")), nrow = 3, ncol = 2, show_title = T,
  samples = samps_to_show
)

We can observe that the samples with Sig1 enriched have many homozygous deletions, this is unreasonable. I made a communication with the author of FACETS (issue link),

So only logical explanation is homozygous deletion. However they are too many of them and so I believe this is assay artifact rather than real.

This problem seems to come from the raw data. However, we checked the data and this signature cannot be found from the result of Sequenza.

To summary, we used result from Sequenza instead of FACETS for downstream analyses because we found there may be some unknown method problems in FACETS (in general, this method is still reliable).

Method comparison between Macintyre et al and our study

In section “Tally variation records”, I said that our method (defined as “Wang” or “W”) has some advantages over the method used in Macintyre et al (defined as “Macintyre” or “M”).

I will talk the details here.

Firsty let’s load the results from “M” method (the code to generate the data can be found here).

load("../output/Sig.CNV.seqz.M.RData")
load("../output/EST.seqz.M.RData")
load("../output/CNV.seqz.tally.M.RData")

Sample-by-component matrix generation

The key difference between method “W” and method “M” is the way to generate the sample-by-component matrix.

Method “W” defines the components and counts them (details see section “Tally variation records”).
Method “M” applies mixture modeling to separate the 6 copy number features distributions into mixtures of Poisson or Gaussian distributions.

Here we found that method “M” has the following drawbacks:

the computation needs huge computing resources. For generating the matrix used for NMF algorithm, it takes about 5100s (~1.5h) with 20 cores for method “M” while only 86s (~1.5m) with 20 cores for method “W”.
the number of component is dependent on the raw data. If we change the sample size or apply the method “M” to another cohort, the number of component may change. Method “W” will not be affected by this because it uses a fixed number of component.
the distribution parameters for each distribution are not identical for different runs. Even for the same data, the result distributions and sample-by-component matrix may change.

At the start of this study, I used the method “M” for all analyses. Finally, we created method “W” based on the use and rethinking of method “M” and SBS signature extraction method from previous reports.

Next section, I will talk more problems raised by method “M” from the visualization view.

Copy number signatures from “M” method

We extracted 5 signatures like method “W”. Let’s see their signature profile.

show_sig_profile(Sig.CNV.seqz.M,
                 mode = "copynumber",
                 method = "M",
                 normalize = "column", style = "cosmic", 
                 paint_axis_text = FALSE)

This is a bar plot similar to the one reported by Macintyre et al.

This plot raises some issues.

Issue 1

The matrix used to generate the plot is column normalized, so the heights of bar cannot be compared with each other.

For example, firstly we see the Sig3, we may conclude that component “bp10MB2” has more contributions than component “bp10MB1”. But if we see the raw matrix, we can see that this is wrong!

We cannot compare two or more componets in a row when we normalize the data by column! The plot above may misguide the readers. For the plot above, we can only compare the same component across all signatures. For example, we can conclude that Sig5 captures the majority of component “segsize12” and “segsize11”.

show_sig_profile(Sig.CNV.seqz.M,
                 mode = "copynumber",
                 method = "M",
                 normalize = "raw", style = "cosmic", 
                 paint_axis_text = FALSE)

To avoid misguiding readers, gradient colors may work. Also, we must tell readers in text that we should compare one component in a column insteads two or more componets in a row.

show_sig_profile(Sig.CNV.seqz.M, 
                 mode = "copynumber", method = "M",
                 normalize = "column", style = "cosmic", 
                 paint_axis_text = FALSE, set_gradient_color = TRUE)

Issue 2

From plot above, we can only see different components, but we cannot see the meaning of each component. For method “M”, each component represents a distribution.

For example, what’s the meaning of the component “copynumber2”? To fix this issue, we tried adding the key parameter value in the plot.

show_sig_profile(Sig.CNV.seqz.M, 
                 mode = "copynumber", method = "M",
                 normalize = "column", style = "cosmic", 
                 paint_axis_text = FALSE, set_gradient_color = TRUE, 
                 params = CNV.seqz.tally.M$parameters, y_expand = 1.5)

According to the text just added, we can see that component “copynumber2” represents a Gaussian distribution with mean is 2; component “copynumber1” represents a Gaussian distribution with mean is 0.84.

Issue 3

Even with the plot above, we cannot know the dispersion of each component. All distributions with weight for each copy number feature can be viewed as the below.

show_cn_components(CNV.seqz.tally.M$parameters, show_weights = TRUE)

Why we created method “W”

Here, I want to talk something more.

If we take case of all available values for each features, we would notice that there are only limited choices. Let’s visualize them.

load("../output/CNV.seqz.tally.W.RData")

show_cn_features(CNV.seqz.tally.W$features[1:6], method = "W", log_y = TRUE)

Naturally we can divide them into categories and get a fixed number of components. It would be easy to read, understand and compare. Look back this part, why we choose a complicated method with many drawbacks instead of a simple one?

Some validation analysis in other datasets

We selected 1 PC datasets with copy number segmentation data from cBioPortal:

http://www.cbioportal.org/study/summary?id=prad_mskcc_2020 (Primary and metastatic tumors)

The segmentation data were converted into absolute copy number data by ABSOLUTE method (Carter et al. 2012) with DoAbsolute. Signatures were extracted with “W” method.

library(sigminer)
library(RColorBrewer)
library(ggplot2)

load("../output/Sig.CNV.mskcc.RData")

# Switch Sig1 and Sig2
Sig.CNV.mskcc = sig_modify_names(Sig.CNV.mskcc, new_names = paste0("Sig", c(2,1,3:5)))

Profile

p = show_sig_profile(Sig.CNV.mskcc, 
                     mode = "copynumber",
                     style = "cosmic", method = "W", normalize = "feature")
p

Exposure

p = show_sig_exposure(Sig.CNV.mskcc, rm_space = TRUE, style = "cosmic")
p

Similarity matrix

Here show results of similarity calculation signatures.

load("../output/Sig.CNV.seqz.W.RData")

## Use major result as reference
## BoChr feature is removed because it is variable in profile
## and affect the judgement
sim_mskcc = get_sig_similarity(Sig.CNV.mskcc, Sig.CNV.seqz.W, normalize = "feature",  pattern_to_rm = "BoChr")

rownames(sim_mskcc$similarity) = paste(rownames(sim_mskcc$similarity), "(MSKCC)")

pheatmap::pheatmap(sim_mskcc$similarity[c(2,1,3:5), ], cluster_cols = F, cluster_rows = F, display_numbers = TRUE)

sig.seqz = sig_signature(Sig.CNV.seqz.W, normalize = "feature") %>% as.data.frame()
sig.mskcc = sig_signature(Sig.CNV.mskcc, normalize = "feature")[, c(2,1,3:5)] %>% as.data.frame()


cor.test(sig.seqz$Sig1, sig.mskcc$Sig1)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  sig.seqz$Sig1 and sig.mskcc$Sig1
#> t = 8.6516, df = 78, p-value = 5.122e-13
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.5672875 0.7969625
#> sample estimates:
#>       cor 
#> 0.6997817
cor.test(sig.seqz$Sig2, sig.mskcc$Sig2)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  sig.seqz$Sig2 and sig.mskcc$Sig2
#> t = 3.3769, df = 78, p-value = 0.001146
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.1491269 0.5348861
#> sample estimates:
#>       cor 
#> 0.3571418
cor.test(sig.seqz$Sig3, sig.mskcc$Sig3)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  sig.seqz$Sig3 and sig.mskcc$Sig3
#> t = 15.821, df = 78, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.8085723 0.9169635
#> sample estimates:
#>       cor 
#> 0.8731653
cor.test(sig.seqz$Sig4, sig.mskcc$Sig4)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  sig.seqz$Sig4 and sig.mskcc$Sig4
#> t = 16.486, df = 78, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.8207251 0.9225298
#> sample estimates:
#>       cor 
#> 0.8814861
cor.test(sig.seqz$Sig5, sig.mskcc$Sig5)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  sig.seqz$Sig5 and sig.mskcc$Sig5
#> t = 30.086, df = 78, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.9374232 0.9739083
#> sample estimates:
#>       cor 
#> 0.9595117

Further validate with heatmap of signature component weight.

library(NMF)

pheatmap::pheatmap(sig_signature(Sig.CNV.seqz.W, normalize = "feature"),
                   cluster_rows = F, cluster_cols = F, 
                   fontsize_row = 5,
                   color = colorRampPalette(brewer.pal(n = 7, name =
                                                         "OrRd"))(100),
                   border_color = NA,
                   filename = "../figures/Heatmap_cnv_our_cohort.pdf")

pheatmap::pheatmap(sig_signature(Sig.CNV.mskcc, normalize = "feature")[, c(2,1,3:5)],
                   cluster_rows = F, cluster_cols = F, 
                   fontsize_row = 5,
                   color = colorRampPalette(brewer.pal(n = 7, name =
                                                         "OrRd"))(100),
                   border_color = NA,
                   filename = "../figures/Heatmap_cnv_mskcc_cohort.pdf")

Exposure comparison

expo_ours = get_sig_exposure(Sig.CNV.seqz.W, type = "relative")
expo_mskcc = get_sig_exposure(Sig.CNV.mskcc, type = "relative")

expo = rbind(expo_ours[, Cohort:="Our"],
             expo_mskcc[, Cohort:="MSKCC2020"])

expo_df = tidyr::pivot_longer(expo,
                              cols = dplyr::starts_with("Sig"),
                              names_prefix = "Sig",
                              names_to = "Signature", values_to = "Exposure")

library(ggpubr)

p = ggboxplot(expo_df, x = "Signature", y = "Exposure", fill = "Cohort") + 
  stat_compare_means(aes(group=Cohort), label = "p.format")
p

expo_ours2 = get_sig_exposure(Sig.CNV.seqz.W)
expo_mskcc2 = get_sig_exposure(Sig.CNV.mskcc)

expo2 = rbind(expo_ours[, Cohort:="Our"],
             expo_mskcc[, Cohort:="MSKCC2020"])

expo_df2 = tidyr::pivot_longer(expo2,
                              cols = dplyr::starts_with("Sig"),
                              names_prefix = "Sig",
                              names_to = "Signature", values_to = "Exposure")

ggpubr::gghistogram(
  expo_df2,
  x = "Exposure",
  y = "..density..",
  color = "Cohort",
  rug = TRUE,
  facet.by = "Signature",
  scales = "free",
  bins = 50
)

Survival analysis

To better understand the signatures, here we evaluate the association between signature exposure and prognosis.

Firstly, get the signature exposure.

expo_mskcc = get_sig_exposure(Sig.CNV.mskcc)

groups = get_groups(Sig.CNV.mskcc, method = "consensus", match_consensus = TRUE)
#>    
#>     Sig1 Sig2 Sig3 Sig4 Sig5
#>   1   29    7    0    3    1
#>   2    0   76    0    0    0
#>   3    6   20  132    3    1
#>   4    0   15    0  318    0
#>   5    0   24    0   60  607

Then we get survival status, patient IDs, sample IDs etc.

mskcc_patient = readr::read_tsv("../data/mskcc_2020/data_clinical_patient.txt", comment = "#")
mskcc_sample  = readr::read_tsv("../data/mskcc_2020/data_clinical_sample.txt", comment = "#")

Merge data before doing survival analysis with ezcox package.

dat_mskcc = expo_mskcc %>% 
  dplyr::as_tibble() %>% 
  dplyr::left_join(
    mskcc_sample %>% dplyr::select(SAMPLE_ID, PATIENT_ID, SAMPLE_TYPE),
    by = c("sample" = "SAMPLE_ID")
  ) %>% 
  dplyr::left_join(
    mskcc_patient,
    by = "PATIENT_ID"
  )

Clean. Still scale exposure by dividing 10. (Keep consistent with scaling method for previous dataset.)

cols_to_sig = paste0("Sig", 1:5)

# range01 <- function(x, ...) {
#   (x - min(x, ...)) / (max(x, ...) - min(x, ...))
# }
# 
# dat_mskcc2 = dat_mskcc %>% 
#   dplyr::select(sample:SAMPLE_TYPE, OS_STATUS, OS_MONTHS) %>% 
#   dplyr::mutate(
#     OS_STATUS = ifelse(OS_STATUS == "LIVING", 0, 1)
#   ) %>% 
#   na.omit() %>% 
#   dplyr::mutate_at(
#     cols_to_sig,
#     ~ 20 * range01(., na.rm = TRUE)
#   )

dat_mskcc2 = dat_mskcc %>% 
  dplyr::select(sample:SAMPLE_TYPE, OS_STATUS, OS_MONTHS) %>% 
  dplyr::mutate(
    OS_STATUS = ifelse(OS_STATUS == "LIVING", 0, 1)
  ) %>% 
  na.omit() %>% 
  dplyr::mutate_at(
    cols_to_sig,
    ~ ./10
  )

dat_mskcc2 = dplyr::left_join(
  dat_mskcc2, groups,
  by = "sample"
)

Now we do survival analysis.

library(ezcox)

# Unvariable analysis
show_forest(dat_mskcc2, covariates = cols_to_sig,
            time = "OS_MONTHS", status = "OS_STATUS", merge_models = TRUE)

Check primary and metastatic samples independently.

show_forest(dat_mskcc2 %>% dplyr::filter(SAMPLE_TYPE == "Primary"), covariates = cols_to_sig,
            time = "OS_MONTHS", status = "OS_STATUS", merge_models = TRUE)

show_forest(dat_mskcc2 %>% dplyr::filter(SAMPLE_TYPE != "Primary"), covariates = cols_to_sig,
            time = "OS_MONTHS", status = "OS_STATUS", merge_models = TRUE)

library(survival)
library(survminer)

legend_2 <- ggplot2::guides(color = ggplot2::guide_legend(nrow = 2, byrow = TRUE))

fit <- survfit(Surv(OS_MONTHS, OS_STATUS) ~ 1, data = dat_mskcc2)
ggsurvplot(fit, palette = "aaas", xlab = "Time (months)")$plot

fit <- survfit(Surv(OS_MONTHS, OS_STATUS) ~ 1, subset(dat_mskcc2, SAMPLE_TYPE == "Primary"))
ggsurvplot(fit, palette = "aaas", xlab = "Time (months)")$plot

fit <- survfit(Surv(OS_MONTHS, OS_STATUS) ~ 1, subset(dat_mskcc2, SAMPLE_TYPE != "Primary"))
ggsurvplot(fit, palette = "aaas", xlab = "Time (months)")$plot

fit <- survfit(Surv(OS_MONTHS, OS_STATUS) ~ 1, data = dat_mskcc2)
ggsurvplot(fit, palette = "aaas", xlab = "Time (months)")$plot

fit <- survfit(Surv(OS_MONTHS, OS_STATUS) ~ enrich_sig, data = dat_mskcc2)
ggsurvplot(fit, palette = "aaas", xlab = "Time (months)",
           pval = TRUE)$plot + legend_2

fit <- survfit(Surv(OS_MONTHS, OS_STATUS) ~ enrich_sig, data = subset(dat_mskcc2, SAMPLE_TYPE == "Primary"))
ggsurvplot(fit, palette = "aaas", xlab = "Time (months)",
           pval = TRUE)$plot + legend_2

fit <- survfit(Surv(OS_MONTHS, OS_STATUS) ~ enrich_sig, data = subset(dat_mskcc2, SAMPLE_TYPE != "Primary"))
ggsurvplot(fit, palette = "aaas", xlab = "Time (months)",
           pval = TRUE)$plot + legend_2

df.map = dplyr::left_join(
  dat_mskcc, groups,
  by = "sample"
) %>% 
  dplyr::select("SAMPLE_TYPE", "enrich_sig") %>% 
  na.omit()

names(df.map) = c("Sample type", "Enriched signature")
show_group_mapping(df.map, 
                   col_to_flow = "Sample type", 
                   cols_to_map = setdiff(colnames(df.map), "Sample type"),
                   fill_na = "NA", include_sig = TRUE)

Test the correlation with fisher test.

zz = table(df.map$`Sample type`, df.map$`Enriched signature`)
fisher.test(zz, simulate.p.value = T)
#> 
#>  Fisher's Exact Test for Count Data with simulated p-value (based on
#>  2000 replicates)
#> 
#> data:  zz
#> p-value = 0.0004998
#> alternative hypothesis: two.sided

Reference

Alexandrov, Ludmil B, Serena Nik-Zainal, David C Wedge, Samuel AJR Aparicio, Sam Behjati, Andrew V Biankin, Graham R Bignell, et al. 2013. “Signatures of Mutational Processes in Human Cancer.” Nature 500 (7463). Nature Publishing Group: 415.

Armenia, Joshua, Stephanie AM Wankowicz, David Liu, Jianjiong Gao, Ritika Kundra, Ed Reznik, Walid K Chatila, et al. 2018. “The Long Tail of Oncogenic Drivers in Prostate Cancer.” Nature Genetics 50 (5). Nature Publishing Group: 645.

Carter, Scott L, Kristian Cibulskis, Elena Helman, Aaron McKenna, Hui Shen, Travis Zack, Peter W Laird, et al. 2012. “Absolute Quantification of Somatic Dna Alterations in Human Cancer.” Nature Biotechnology 30 (5). Nature Publishing Group: 413.

Favero, Francesco, Tejal Joshi, Andrea Marion Marquard, Nicolai Juul Birkbak, Marcin Krzystanek, Qiyuan Li, Z Szallasi, and Aron Charles Eklund. 2014. “Sequenza: Allele-Specific Copy Number and Mutation Profiles from Tumor Sequencing Data.” Annals of Oncology 26 (1). European Society for Medical Oncology: 64–70.

Gaujoux, Renaud, and Cathal Seoighe. 2010. “A Flexible R Package for Nonnegative Matrix Factorization.” BMC Bioinformatics 11 (1). BioMed Central: 367.

Krueger, Felix. 2015. “Trim Galore.” A Wrapper Tool Around Cutadapt and FastQC to Consistently Apply Quality and Adapter Trimming to FastQ Files.

Li, Heng. 2013. “Aligning Sequence Reads, Clone Sequences and Assembly Contigs with Bwa-Mem.” arXiv Preprint arXiv:1303.3997.

Li, Heng, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, and Richard Durbin. 2009. “The Sequence Alignment/Map Format and Samtools.” Bioinformatics 25 (16). Oxford University Press: 2078–9.

Macintyre, Geoff, Teodora E Goranova, Dilrini De Silva, Darren Ennis, Anna M Piskorz, Matthew Eldridge, Daoud Sie, et al. 2018. “Copy Number Signatures and Mutational Processes in Ovarian Carcinoma.” Nature Genetics 50 (9). Nature Publishing Group: 1262.

Mayakonda, Anand, and H Phillip Koeffler. 2016. “Maftools: Efficient Analysis, Visualization and Summarization of Maf Files from Large-Scale Cohort Based Cancer Studies.” BioRxiv. Cold Spring Harbor Laboratory, 052662.

McLaren, William, Laurent Gil, Sarah E Hunt, Harpreet Singh Riat, Graham RS Ritchie, Anja Thormann, Paul Flicek, and Fiona Cunningham. 2016. “The Ensembl Variant Effect Predictor.” Genome Biology 17 (1). BioMed Central: 122.

“Picard Toolkit.” 2019. Broad Institute, GitHub Repository. http://broadinstitute.github.io/picard/; Broad Institute.

Shen, Ronglai, and Venkatraman E. Seshan. 2016. “FACETS: Allele-Specific Copy Number and Clonal Heterogeneity Analysis Tool for High-Throughput DNA Sequencing.” Nucleic Acids Research 44 (16): e131–e131. https://doi.org/10.1093/nar/gkw520.

Van der Auwera, Geraldine A, Mauricio O Carneiro, Christopher Hartl, Ryan Poplin, Guillermo Del Angel, Ami Levy-Moonshine, Tadeusz Jordan, et al. 2013. “From Fastq Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline.” Current Protocols in Bioinformatics 43 (1). Wiley Online Library: 11–10.

Wang, Shixiang, and Xuesong Liu. 2019. “The Ucscxenatools R Package: A Toolkit for Accessing Genomics Data from Ucsc Xena Platform, from Cancer Multi-Omics to Single-Cell Rna-Seq.” Journal of Open Source Software 4 (40): 1627.

Yi, Kijong, and Young Seok Ju. 2018. “Patterns and Mechanisms of Structural Variations in Human Cancer.” Experimental & Molecular Medicine 50 (8). Nature Publishing Group: 98.

Prostate Cancer Variation Signature Analysis Report

Prostate Cancer Variation Signature Analysis Report

Data Preprocessing

Data downloading

WES data

Phenotype data

Survival data

Data preprocessing

Pipeline

Sequence alignment

1. from sra to fastq

2. removed the adapters

3. bwa

4. sam to bam

5. sort bam

6. mark duplications

7. BQSR

Copy number calling

FACETS

1. SNP pileup

2. run facets package

Sequenza

1. process a fasta file to produce a GC wiggle track file

2. process bam and wiggle files to produce a small seqz file:

3. call copy number in R

Mutation calling

1. call somatic SNVs and short INDELs

2. estimate cross-sample contamination using GetPileupSummaries and CalculateContamination.

3. get tumor artifacts using LearnReadOrientationModel.

4. keep confident somatic calls using FilterMutectCalls.

5. get PASS labeled mutation

6. annotate mutations

7. convert VCF to MAF

Data cleaning

Phenotype data

Survival data

Variation profile

Signature Identification

Tally variation components

Methods

Tally components

Copy number

SBS

Estimate signature number

Copy number

SBS

Result

Extract signatures

Show signature profile and aetiology

SBS signature profile

Copy number signature profile

Show similarity matrix

Check the signature exposure

Mutation

Copy number

Association Analysis

Data integration

Association analysis

Prepare

Signatures & features

Signatures & mutated genes

Signatures & mutated pathways

Correlation network

Signatures & PSA

Group Analysis

Group comparison

Group mapping

Grouped signature exposure profile

Copy number signatures

SBS signatures

Copy number profile of different groups

Sig1 cases

Sig2 cases

Sig3 cases

Sig4 cases

Sig5 cases

Selected cases

Survival Analysis

Prepare

OS analysis fo signatures

1. from `sra` to `fastq`

4. `sam` to `bam`

5. sort `bam`

2. run `facets` package

1. process a `fasta` file to produce a GC `wiggle` track file

2. process `bam` and `wiggle` files to produce a small `seqz` file: