Contact Prof. Xue-Song Liu:

This report is written to help readers understand what and how we did in this project. Please read the formal manuscript TBD for more details.

This document is compiled from an Rmarkdown file which contains all code or description necessary to (auto-)reproduce the analysis for the accompanying project. Each section below describes a different component of the analysis and all numbers and figures are generated directly from the underlying data on compilation.

Data Preprocessing

This part focuses on data preprocessing, it contains the following sections:

Data downloading

This section describes where and how we downloaded the data.

WES data

Whole exome sequencing raw data were downloaded from the following 6 dbGap studies:

  • phs000178: this is TCGA prostate cancer cohort, the raw data is bam format stored in GDC portal, we access them by dbGap permission
  • phs000447: sra data
  • phs000554: sra data
  • phs000909: sra data
  • phs000915: sra data
  • phs001141: sra data

A recent study (Armenia et al. 2018) used 5 dbGap studies:

  • phs000178
  • phs000447
  • phs000909
  • phs000915
  • phs000945

Our study contains two new datasets phs000554 and phs001141 than Armenia et al. (2018), however, phs000945 was excluded from our study due to its unavailable status in dbGap database.

Therefore, to our knowledge, we included all available raw WES data in our study.

After selecting the raw WES data, we downloaded them by either sratools for dbGap data or gdc-client for TCGA data.

Phenotype data

Phenotype data of phs000447, phs000554, phs000909, phs000915 and phs001141 were downloaded along with the raw WES data. Phenotype data of TCGA prostate cancer cohort were downloaded from UCSC Xena by UCSCXenaTools (Wang and Liu 2019). The cleaning process is described in ‘Data cleaning’ section.

Survival data

Survival data of phs000554 were included in phenotype data described above. Survival data of TCGA prostate cancer cohort were downloaded from UCSC Xena by UCSCXenaTools (Wang and Liu 2019). The cleaning process is described in ‘Data cleaning’ section.

Data preprocessing

This section describes how we preprocessed the data.

Pipeline

The following diagram shows our upstream workflow. Steps before signature identification are belong to preprocessing. The preprocessing pipeline generates the variation records of samples.

Genomic variation signature identification pipeline of prostate cancer

Sequence alignment

For cohorts including phs000447, phs000554, phs000909, phs000915 and phs001141, the raw data are in sra format. We did sequence alignment to them. To keep in line with TCGA bam data, we used the same reference genome (downloaded from GDC portal) and operations (see GDC docs).

Of note, the preprocessing step was done on Linux server provided by High Performance Computing Service of ShanghaiTech University. So the work is presented in PBS scripts. If readers want to reproduce this work, please focus on the key code lines.

The details are described as below.

7. BQSR

A base quality score recalibration (BQSR) step was then performed.

#PBS -N bqsr_<sample1>
#PBS -l nodes=1:ppn=4
#PBS -l walltime=30:00:00
#PBS -S /bin/bash
#PBS -q normal_8
#PBS -j oe

source activate wes
cd /public/home/liuxs/ncbi/dbGaP-16533/dnaseq/markqump

sample=<sample1>
dbsnp=/public/home/liuxs/biodata/reference/genome/hg38/ftp.broadinstitute.org/bundle/hg38/dbsnp_146.hg38.vcf.gz
dbsnp1000G=/public/home/liuxs/biodata/reference/genome/hg38/ftp.broadinstitute.org/bundle/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz 
dbindel100G=/public/home/liuxs/biodata/reference/genome/hg38/ftp.broadinstitute.org/bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz

if [ -e $sample.rdup.bam ]; then
   java -jar -Xmx12G -Djava.io.tmpdir=/public/home/liuxs/lhm/tmp  /public/home/liuxs/anaconda3/envs/wes/share/gatk4-4.1.3.0-0/gatk-package-4.1.3.0-local.jar BaseRecalibrator \
    -R /public/home/liuxs/biodata/reference/genome/gdc_hg38/GRCh38.d1.vd1.fa \
    -I ${sample}.rdup.bam \
    -O /public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/table/${sample}.recal_data.table \
    --known-sites $dbsnp --known-sites $dbsnp1000G --known-sites $dbindel100G && \
   java -jar -Xmx12G -Djava.io.tmpdir=/public/home/liuxs/lhm/tmp   /public/home/liuxs/anaconda3/envs/wes/share/gatk4-4.1.3.0-0/gatk-package-4.1.3.0-local.jar  ApplyBQSR  \
    -R /public/home/liuxs/biodata/reference/genome/gdc_hg38/GRCh38.d1.vd1.fa \
    -I ${sample}.rdup.bam \
    --bqsr-recal-file /public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/table/${sample}.recal_data.table \
    -O /public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/bqsrbam/${sample}.sorted.marked.BQSR.bam 
    echo gatk-BQSR `date`
fi

The result bam files were then used for copy number calling and mutation calling.

Copy number calling

The absolute copy number profile for each sample was detected by two well known softwares FACETS (Shen and Seshan 2016) and Sequenza (Favero et al. 2014).

The code to generate tumor-normal pair for FACETS is here, the code to generate tumor-normal pair for Sequenza is here.

FACETS

We followed the standard pipeline of FACETS described in vignette to call absolute copy number.

1. SNP pileup

This step followed the https://github.com/mskcc/facets/blob/master/inst/extcode/README.txt.

For dbGap studies:

For TCGA:

Sequenza

We followed the standard pipeline of Sequenza described in vignette to call absolute copy number. A modified copynumber package was used to work with hg38 genome build.

Mutation calling

Somatic mutations were detected by following GATK best practice with Mutect2 (Van der Auwera et al. 2013).

1. call somatic SNVs and short INDELs

#PBS -N call-<sample>
#PBS -l nodes=1:ppn=7
#PBS -l walltime=30:00:00
#PBS -S /bin/bash
#PBS -q normal_8
#PBS -j oe

source activate wes
gene=/public/home/liuxs/biodata/reference/genome/gdc_hg38
normal=/public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/bqsrbam
tumor=/public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/bqsrbam
path=/public/home/liuxs/ncbi/dbGaP-21926/dnaseq/BQSR/bqsrbam
ref=/public/home/liuxs/biodata/reference/genome/hg38/test/ftp.broadinstitute.org/bundle/Mutect2
out=/public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/vcf
inter=/public/home/liuxs/ncbi/dbGaP-16533/run/gatk/mutect
out2=/public/home/liuxs/ncbi/dbGaP-21926/dnaseq/BQSR/vcf
name=<sample1>
if [ -e /public/home/liuxs/ncbi/dbGaP-16533/dnaseq/BQSR/bqsrbam/$name.sorted.marked.BQSR.bam ]; 
then 
gatk --java-options "-Xmx20G -Djava.io.tmpdir=/public/home/liuxs/lhm/tmp"  Mutect2 \
 -R $gene/GRCh38.d1.vd1.fa \
 -L $inter/Homo_sapiens_assembly38.targets.interval_list \
 -I $tumor/<sample1>.sorted.marked.BQSR.bam \
 -tumor <sample1> \
 -I $normal/<sample2>.sorted.marked.BQSR.bam  \
 -normal <sample2>\
 --germline-resource $ref/af-only-gnomad.hg38.vcf.gz \
 --panel-of-normals /public/home/liuxs/biodata/reference/genome/hg38/test/mu_call_ref/ref/ref/test/1000g_pon.hg38.vcf.gz \
 --af-of-alleles-not-in-resource 0.0000025 \
 --disable-read-filter  MateOnSameContigOrNoMappedMateReadFilter \
 --bam-output $out/bam/<sample>.bam \
 --f1r2-tar-gz $out/fir2/<sample>.tar.gz \
 -O $out/<sample>.vcf
else
gatk --java-options "-Xmx20G -Djava.io.tmpdir=/public/home/liuxs/lhm/tmp"  Mutect2 \
   -R $gene/GRCh38.d1.vd1.fa \
   -L $inter/Homo_sapiens_assembly38.targets.interval_list \
   -I $path/<sample1>.sorted.marked.BQSR.bam \
   -tumor <sample1> \
   -I $path/<sample2>.sorted.marked.BQSR.bam  \
   -normal <sample2>\
   --germline-resource $ref/af-only-gnomad.hg38.vcf.gz \
   --panel-of-normals /public/home/liuxs/biodata/reference/genome/hg38/test/mu_call_ref/ref/ref/test/1000g_pon.hg38.vcf.gz \
   --af-of-alleles-not-in-resource 0.0000025 \
   --disable-read-filter  MateOnSameContigOrNoMappedMateReadFilter \
   --bam-output $out2/bam/<sample>.bam \
   --f1r2-tar-gz $out2/fir2/<sample>.tar.gz \
   -O $out2/<sample>.vcf
fi

2. estimate cross-sample contamination using GetPileupSummaries and CalculateContamination.

Data cleaning

This section describes how we cleaned the data for downstream analysis.

Phenotype data

2 info tables of SRA runs downloaded from dbGap are stored in here. Phenotype data of dbGap studies are stored in here. Phenotype data of some cohorts were also downloaded from original articles and cleaned by hand. We then carefully checked and compared all available data, extracted key info and generated tidy datasets including sample-pair data used for variation calling and phenotype data used for downstream analysis.

The whole procedure is recorded in 01-tidy_mapping.R, 02-tidy_clinical_data.R and 03-combine_and_clean_IDs.R.

Survival data

We merged two available clinical datasets from TCGA and phs000554 with R script.

Variation profile

Copy number calling and mutation calling all generated per-sample results. We go further merged them into one file, respectively.

For results from Sequenza, we merged them with R script 00-generate-samplefile-for-seqz-cnv-calling.R.

For results from FACETS, we merged them with R script 00-generate-samplefile-for-facets-cnv-calling.R.

For results from Mutect2, we merged VCFs into MAF with tool vcf2maf. TODO: more detail code.

Signature Identification

This part has been divided into 4 sections as the following, it describes how to get the genomic variation (or instability) signatures from mutation profile (including copy number profile and single base substitution (SBS) profile).

All the processes have been implemented in R package sigminer.

Tally variation components

Methods

For SBS profile, same as previously reported (Alexandrov et al. 2013), for each sample, we firstly classified mutation records into six substitution subtypes: C>A, C>G, C>T, T>A, T>C, and T>G (all substitutions are referred to by the pyrimidine of the mutated Watson–Crick base pair). Further, each of the substitutions was examined by incorporating information on the bases immediately 5’ and 3’ to each mutated base generating 96 possible mutation types (6 types of substitution ∗ 4 types of 5’ base ∗ 4 types of 3’ base). Each of 96 mutation types is called component here.

For copy number profile, we firstly computed the genome-wide distributions of 8 fundamental copy number features for each sample:

  • the breakpoint count per 10 Mb (named BP10MB)
  • the breakpoint count per chromosome arm (named BPArm)
  • the copy number of the segments (named CN)
  • the difference in copy number between adjacent segments (named CNCP)
  • the lengths of oscillating copy number segment chains (named OsCN)
  • the log10 based size of segments (named SS)
  • the minimal number of chromosome with 50% copy number variation (named NC50)
  • the burden of chromosome (named BoChr)

These features were selected as hallmarks of previously reported genomic aberations like chromothripsis or to denote the distribution pattern of copy number events. The former 6 features have been used in Macintyre et al. (2018) to uncover the mutational processes in ovarian carcinoma.

Next, unlike Macintyre et al. (2018) applied mixture modeling to separate the first 6 copy number features distributions into mixtures of Poisson or Gaussian distributions, we directly classified 8 copy number features distributions into different components according to the comprehensive consideration of value range, abundance and biological significance. Most of the result are discrete values, and the others are range values.

The setting of 8 features with 80 components are shown as below.

When component is a discrete value (label is ‘point’), min=max.

When component is a range value (label is ‘range’), the range is left open and right closed.

Of note, the blank in min column is -Inf and the blank in max column is Inf. [23] in BoChr represents chromosome X (chromosome Y is excluded).

Compare to the method from Macintyre et al. (2018), our method has several advantages:

  1. the computation is more efficient.
  2. the meaning of component is easier to read and understand.
  3. the new features NC50 and BoChr can be used to determine the distribution pattern of copy number events (global or local). and the contribution of each chromosome.
  4. most importantly, the classification is fixed, so it is much easier to compare the signatures within/across tumor types, the results across different studies and the result signatures to reference signatures. In a word, our method constructs the standard to extract, study and compare copy number signatures.
  5. our method is more extensible.

I talk more about No.1-4 in an individual section (“Method comparison between Macintyre et al and our study”), the No.5 will be described in sigminer vignette(TBD).

We will generate sample-by-component matrix using the methods above and treate the result as the input of nonnegative matrix decomposition (NMF) algorithm for extracting signatures.

Tally components

Estimate signature number

Here, r package NMF (Gaujoux and Seoighe 2010) is used for running NMF algorithm Factorization rank r in NMF defines the number of signatures used to approximate the target sample-by-component matrix. A common way for deciding on r is to try different values, compute some quality measure of the results, and choose the best value according to this quality criteria.

As suggested, performing 30-50 runs is considered sufficient to get a robust estimation of the r value. Here we perform 50 runs.

Result

Let’s load the estimated results.

The most common approach is to use the cophenetic correlation coefficient. I show the measure vs. signature number as below.

The best possible signature number value is the one at which the cophenetic correlation value on the y-axis drops significantly (Gaujoux and Seoighe 2010). Here we can take that the value 5 is the optimal signature number. For 3, it is small to produce the result with biological significance due to the hugo contribution of CNV in prostate cancer; for 8, it is a little bigger, so it may be not easy to understand the result considering currently we know few about the mechanism and consequence of copy number events.

Next we take a look at SBS.

For SBS, we select 3. We have the knowledge that 3 mutational signatures from TCGA prostate cancers are recorded in COSMIC database (COSMIC 1, 5, 6), and the cophenetic has a sharp decrease after 3 indicating that substantially less stability is achieved using more than 3 clusters.

Show signature profile and aetiology

Now that we have the extracted signatures, we can show the signature profile.

SBS signature profile

SBS signatures are displayed based on the observed component frequency of the human genome, i.e., representing the relative proportions of mutations generated by each signature based on the actual trinucleotide frequencies of the reference human genome.

The aetiology of SBS signatures is more clear than the aetiology of copy number signatures. We can find aetiologies of the 3 signatures above by computing their cosine similarity to COSMIC reference signatures.

We compare the 3 signature to COSMIC signature database v2 and v3.

From the results above, basically we can confirm that the aetiology of Sig2 is dMMR (defects in mismatch repair) and the aetiology of Sig3 is aging. The aetiology of Sig1 may be HRD (Defective homologous recombination-based DNA damage repair) or an unknown reason.

The exposure of signatures in each sample is plotted as the below.

Set a cutoff 2000 to remove some outliers.

The Est_Counts panel shows the estimated SBS count.

We can clearly see that

  • most of samples have very few SBS signature exposures.
  • Sig1 and Sig3 are the major resources.

Copy number signature profile

Similar to SBS signatures, copy number signatures are displayed based on the observed component frequency of the human genome. Of note, considering the count process of each feature is relatively independent, the profile is row normalized by each feature, unlike Macintyre et al. (2018) did column normalization (this method is easy to mislead readers), so the bar height can be compared within/between features.

There is no reference database for copy number signatures, we cannot get aetiologies of the 5 signatures by similarity computation. By reading the description of previous reports (Macintyre et al. 2018; Yi and Ju 2018) and studying the signature profile carefully, we determine or limite aetiologies of the 5 signatures with also the downstream analyses supported. Please see our manuscript (TODO) for details.

The exposure of signatures in each sample is plotted as the below.

The Est_Counts panel shows the estimated copy number segment count.

Check the signature exposure

Signature exposure is the estimation of mutation calling records. Here we check this by correlation analysis.

Association Analysis

This part has been divided into 2 sections: data integration and association analysis.

Data integration

In this section, I integrate all genotype/phenotype data as a tidy data table used for downstream analyses.

Load packages and prepared data.

After running NMF, we can get robust clusters from consensus matrix, more see ?NMF::predict.

I also create functions get_sig_exposure() to get the absolute/relative exposure of signatures and scoring() to get scores of typical copy number features. More run ?sigminer::get_sig_exposure and ?sigminer::scoring in your R console or see the package manual.

Relative exposure is more useful for clustering and absolute exposure is more useful for association analysis.

We find that the group 1 is assigned to Sig3 due to sample with Sig3 dominant has the maximum fraction in group 1. However, group 1 is the only group with Sig1 enriched. So we modify this result.

Next, we process mutation data to get TMB, driver info, clusters, and etc (detail of some step has been described in R script of the repo).

# Processing mutation data ------------------------------------------------

load(file = "../output/PRAD_TCGA_plus_dbGap_Maf.RData")
load(file = "../output/Sig.PRAD_TCGA_plus_dbGap_Maf.RData")

Maf_samp = data.table::data.table(
  Tumor_Sample_Barcode = unique(rbind(Maf@data, Maf@maf.silent)$Tumor_Sample_Barcode)
)

TMBInfo <- Maf@variant.type.summary
TMBInfo$n_INDEL = TMBInfo$DEL + TMBInfo$INS
TMBInfo = TMBInfo[, .(Tumor_Sample_Barcode, n_INDEL, SNP)]
colnames(TMBInfo)[3] = "n_SNV"
TMBInfo = merge(Maf_samp, TMBInfo, by = "Tumor_Sample_Barcode", all = TRUE)
# Fill NAs
TMBInfo = TMBInfo %>%
  dtplyr::lazy_dt() %>%
  dplyr::mutate_at(vars(dplyr::starts_with("n_")), ~ifelse(is.na(.), 0, .)) %>%
  data.table::as.data.table()

load(file = "../output/PRAD_driver_info.RData")
load(file = "../output/PRAD_heter_info.RData")

SNVGroupInfo <- get_groups(Sig.SNV, method = "consensus", match_consensus = TRUE)
#> ℹ [2020-06-04 14:43:40]: Started.
#> ✓ [2020-06-04 14:43:40]: 'Signature' object detected.
#> ℹ [2020-06-04 14:43:40]: Obtaining clusters from the hierarchical clustering of the consensus matrix...
#> ℹ [2020-06-04 14:43:43]: Finding the dominant signature of each group...
#> => Generating a table of group and dominant signature:
#>    
#>     Sig1 Sig2 Sig3
#>   1   16   34   21
#>   2   30    0  193
#>   3  677    0    0
#> => Assigning a group to a signature with the maxium fraction (stored in 'map_table' attr)...
#> ℹ [2020-06-04 14:43:44]: Summarizing...
#>  group #1: 71 samples with Sig2 enriched.
#>  group #2: 223 samples with Sig3 enriched.
#>  group #3: 677 samples with Sig1 enriched.
#> ! [2020-06-04 14:43:44]: The 'enrich_sig' column is set to dominant signature in one group, please check and make it consistent with biological meaning (correct it by hand if necessary).
#> ℹ [2020-06-04 14:43:44]: 3.223 secs elapsed.
SNVExposureInfo <- get_sig_exposure(Sig.SNV)

# Processing gene and pathway mutation ------------------------------------

load(file = "../output/PRAD_gene_and_pathway_mutation.RData")

Next, we keep keys of all data.frames are same and merge them into one.

# Merge data --------------------------------------------------------------
Info <- Info %>%
  dplyr::mutate(
    CNV_ID = dplyr::case_when(
      !startsWith(tumor_Run, "TCGA") & !is.na(tumor_Run) ~ paste(subject_id, tumor_Run, sep = "-"),
      startsWith(tumor_Run, "TCGA") & !is.na(tumor_Run) ~ tumor_Run,
      TRUE ~ NA_character_
    )
  )

PurityInfo
#> # A tibble: 938 x 3
#>    purity ploidy sample             
#>     <dbl>  <dbl> <chr>              
#>  1   0.59    1.9 00-000450-SRR474658
#>  2   0.55    1.9 00-1165-SRR461844  
#>  3   0.75    1.9 00-160-SRR471373   
#>  4   0.72    2   00-1823-SRR471613  
#>  5   0.73    2   01-1934-SRR471793  
#>  6   0.4     1.9 01-2382-SRR471433  
#>  7   0.45    4.1 01-2492-SRR471733  
#>  8   0.73    2   01-2554-SRR471673  
#>  9   0.22    3.4 01-28-SRR471073    
#> 10   0.35    2   02-1082-SRR472616  
#> # … with 928 more rows
colnames(CNVGroupInfo) <- c("sample", "cnv_group", "cnv_weight", "cnv_enrich_sig")
CNVInfo
#>                  sample n_of_cnv n_of_amp n_of_del n_of_vchr cna_burden
#>   1:    TCGA-HI-7169-01        2        2        0         0      0.000
#>   2:    TCGA-CH-5743-01        3        2        1         1      0.000
#>   3:    TCGA-EJ-7791-01        3        3        0         1      0.000
#>   4:    TCGA-FC-A66V-01        3        3        0         1      0.000
#>   5:    TCGA-J4-AATV-01        3        3        0         0      0.000
#>  ---                                                                   
#> 933: 5115609-SRR8311697      303      291       12        22      0.856
#> 934: 5115412-SRR8311627      356      354        2        22      0.948
#> 935: 5115406-SRR8312068      367      337       30        22      0.297
#> 936:      36-SRR5876749      393      392        1        22      0.926
#> 937: 5115615-SRR8311749      420      406       14        22      0.360

colnames(CNVExposureInfo) <- c("sample", paste0("CN-", colnames(CNVExposureInfo)[-1]))
colnames(TMBInfo)[1] <- "sample"
TMBInfo[, sample := as.character(sample)][, sample := ifelse(startsWith(sample, "TCGA"),
                                                             substr(sample, 1, 15),
                                                             sample
)]

colnames(DriverDF)[1] = "sample"
DriverDF[, sample := as.character(sample)][, sample := ifelse(startsWith(sample, "TCGA"),
                                                              substr(sample, 1, 15),
                                                              sample
)]
colnames(SNVGroupInfo) <- c("sample", "snv_group", "snv_weight", "snv_enrich_sig")
SNVGroupInfo[, sample := as.character(sample)][, sample := ifelse(startsWith(sample, "TCGA"),
                                                                  substr(sample, 1, 15),
                                                                  sample
)]
colnames(SNVExposureInfo) <- c("sample", paste0("SBS-", colnames(SNVExposureInfo)[-1]))
SNVExposureInfo[, sample := as.character(sample)][, sample := ifelse(startsWith(sample, "TCGA"),
                                                                     substr(sample, 1, 15),
                                                                     sample
)]
colnames(TitvInfo) <- c("sample", "Ti_fraction", "Tv_fraction")
TitvInfo[, sample := as.character(sample)][, sample := ifelse(startsWith(sample, "TCGA"),
                                                              substr(sample, 1, 15),
                                                              sample
)]
colnames(MathDF) <- c("sample", "MATH")
MathDF[, sample := as.character(sample)][, sample := ifelse(startsWith(sample, "TCGA"),
                                                            substr(sample, 1, 15),
                                                            sample
)]
colnames(ClusterDF) <- c("sample", "n_mutation_cluster")
ClusterDF[, sample := as.character(sample)][, sample := ifelse(startsWith(sample, "TCGA"),
                                                               substr(sample, 1, 15),
                                                               sample
)]

# data.table::setDT(summary_mutation)
# data.table::setDT(summary_pathway)

colnames(summary_mutation)[1] <- "sample"
colnames(summary_pathway)[1] <- "sample"
summary_mutation[, sample := as.character(sample)][, sample := ifelse(startsWith(sample, "TCGA"),
                                                                      substr(sample, 1, 15),
                                                                      sample
)]
summary_pathway[, sample := as.character(sample)][, sample := ifelse(startsWith(sample, "TCGA"),
                                                                     substr(sample, 1, 15),
                                                                     sample
)]


colnames(CNVInfo)[2:4] = c("n_CNV", "n_Amp", "n_Del")

MergeInfo <- Info[which(!is.na(Info$tumor_Run)), ] %>%
  left_join(CNVInfo[, .(sample, n_CNV, n_Amp, n_Del)], by = c("CNV_ID" = "sample")) %>%
  left_join(CNVscores, by = c("CNV_ID" = "sample")) %>%
  left_join(CNVGroupInfo, by = c("CNV_ID" = "sample")) %>%
  left_join(CNVExposureInfo, by = c("CNV_ID" = "sample")) %>%
  left_join(PurityInfo, by = c("CNV_ID" = "sample")) %>%
  left_join(summary_mutation, by = c("tumor_Run" = "sample")) %>%
  left_join(summary_pathway, by = c("tumor_Run" = "sample")) %>%
  left_join(TMBInfo, by = c("tumor_Run" = "sample")) %>%
  left_join(DriverDF, by = c("tumor_Run" = "sample")) %>%
  dplyr::mutate(
    n_driver = ifelse(!is.na(n_driver), n_driver, 0)
  ) %>%
  left_join(TitvInfo, by = c("tumor_Run" = "sample")) %>%
  left_join(MathDF, by = c("tumor_Run" = "sample")) %>%
  left_join(ClusterDF, by = c("tumor_Run" = "sample")) %>%
  left_join(SNVGroupInfo, by = c("tumor_Run" = "sample")) %>%
  left_join(SNVExposureInfo, by = c("tumor_Run" = "sample")) %>%
  mutate(
    Stage = factor(Stage, ordered = TRUE),
    Fusion = ifelse(Fusion == "Negative", "No", "Yes"),
    sample_type = ifelse(sample_type == "Unknown", "Metastatic", sample_type), # phs000909 are metastatic samples, https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000909.v1.p1
    HasFusion = Fusion,
    HasFusion = ifelse(HasFusion == "Yes", TRUE, FALSE),
    IsMetastatic = ifelse(sample_type == "Metastatic", TRUE, FALSE)
  ) %>% 
  select(-sTDP, -lTDP, -sTDP_size, -lTDP_size, -TDP_pnas)


MergeInfo$TDP = MergeInfo$TDP * MergeInfo$TDP_size

saveRDS(MergeInfo, file = "../output/PRAD_Merge_Info_CNV_from_sequenza_update.RDS")

Association analysis

Now we do association analysis between signatures (via exposure) and genotypes/phenotypes.

Signatures & features

Analyze the association between signatures and features.

Association of signature exposures with other features was performed using one of two procedures: for a continuous association variable (including ordinal variable), pearson correaltion was performed; for a binary association variable, samples were divided into two groups and Mann-Whitney U-test was performed to test for differences in signature exposure medians between the two groups.

Show the result.

All shown circles are statistically significant results (FDR < 0.05). A feature will be chopped off from the plot if has no correlation with any signatures. Same for the similar plots below.

The result table:

Signatures & mutated genes

Analyze the association between signatures and mutated genes (only driver genes identified by MutSig).

The result table:

Signatures & mutated pathways

Analyze the association between signatures and mutated pathways.

The result table:

Signatures & PSA

In all datasets we collected, 3 of them have PSA info. However, the range of PSA value in 3 datasets differs, so I excluded PSA from features above and analyze it independently.

Alought we can observe the statistically significant correlation between signatures and PSA, the results are inconsistent across different studies. This may be explained by the dynamic property of PSA while not the sample type due to most of them are primary.

Group Analysis

This part has been divided into four parts to show the differences across sample groups defined by signature exposure. It provides insight into the heterogeneity of prostate cancers and relative importance of two types of genomic variation signatures.

Load packages.

Load the data.

Group comparison

What if there are 5 SBS signatures and samples are divided into 5 groups?

Copy number profile of different groups

There are many differences across copy number groups. Here we go further showing some classic cases by copy number profile.

There are 5 copy number groups dominated by 4 copy number signatures.

Get the groups and relative exposures and merge them.

Now we can use data to find the cases with a specific signature enriched.

Survival Analysis

To check the clinical relavance of signatures, this part we do survival analysis with Cox model.

Prepare

Load package ezcox for batch Cox analysis.

Load the data.

Clean the data to generate a data table for survival analysis.

range01 <- function(x, ...) {
  (x - min(x, ...)) / (max(x, ...) - min(x, ...))
}

cols_to_sigs.seqz <- c(paste0("CN-Sig", 1:5), paste0("SBS-Sig", 1:3))


surv_dt <- dplyr::inner_join(df.seqz %>%
                               dplyr::select(-c("Study", "subject_id", "tumor_body_site",
                                                "tumor_Run", "normal_Run", "CNV_ID",
                                                "Fusion")),
                             surv_df, by = c("PatientID" = "sample"))

dup_ids = which(duplicated(surv_dt$PatientID))
# Remove duplicated records
surv_dt = surv_dt[-dup_ids, ]

# Scale the signature exposure by dividing 10
# Scale some variables to 0-20.
surv_dt = surv_dt %>%
  dplyr::select(-Tv_fraction) %>% 
  dplyr::mutate(cnaBurden = 20 * cnaBurden,
                Stage = as.character(Stage) %>% factor(levels = c("T2", "T3", "T4"))) %>%
  dplyr::mutate_at(cols_to_sigs.seqz, ~./10) %>% 
  dplyr::mutate_at(c("Ti_fraction", "purity"), ~ 20 * range01(., na.rm = TRUE)) %>% 
  dplyr::mutate(OS.time = OS.time / 30.5,
                PFI.time = PFI.time / 30.5)


# Scale the signature exposure to 0-20.
# To better understand the effect of signature exposure on clinical events,
# here we scale them into range 0-20. 1 increase means 5% increase of signature exposure.
#
# Scale the CNA burden to 0-20.
#
# keep only ti_fraction, tv result will be 1/ti_result
# also scale it to 0-20
# also scale purity to 0-20
# surv_dt = surv_dt %>%
#   dplyr::select(-Tv_fraction) %>% 
#   dplyr::mutate(cnaBurden = 20 * cnaBurden,
#                 Stage = as.character(Stage) %>% factor(levels = c("T2", "T3", "T4"))) %>%
#   dplyr::mutate_at(c(cols_to_sigs.seqz, "Ti_fraction", "purity"), ~ 20 * range01(., na.rm = TRUE))

saveRDS(surv_dt, file = "../output/PRAD_merged_survival_dt_seqz.rds")

K-M curves

No PFI data for metastatic tumors.

Variation Summary

This part discribes some big pictures of genomic variations in prostate cancers. They are not directly related to the core analyses previously reported.

Oncoplot

A simple oncoplot.

Refernce Maftools vignette: http://bioconductor.org/packages/release/bioc/vignettes/maftools/inst/doc/oncoplots.html

To get a comprehensive and meaningful oncoplot (not affected by passagers), here we include GISTIC2, MutSig and MAF information together.

Before plotting, I will go to modify a result file from GISTIC2, this file uses a different sample ID system as MAF, so we must keep them same.

This plot seems not right, talk with author of maftools…

Remove GISTIC2 data from oncoplot.

Sample summary

Here we summarize variations for each sample with table.

Copy number data

MAF data

Summary plots

MAF

Somatic Interactions

#>       gene1  gene2       pValue  oddsRatio  00 11 01 10              Event
#>   1: ABCA13  KMT2C 5.107880e-08 10.2128843 896 13 42 27       Co_Occurence
#>   2:   RYR3 ABCA13 1.510688e-05  9.8852463 915  8 32 23       Co_Occurence
#>   3:  OBSCN    TTN 2.087377e-05  4.7803808 841 15 94 28       Co_Occurence
#>   4:  KMT2C  KMT2D 3.687480e-05  5.3043820 877 12 46 43       Co_Occurence
#>   5:  FOXA1  CDK12 4.320195e-05  5.8603172 873 11 23 71       Co_Occurence
#>  ---                                                                      
#>               pair event_ratio
#>   1: ABCA13, KMT2C       13/69
#>   2:  ABCA13, RYR3        8/55
#>   3:    OBSCN, TTN      15/122
#>   4:  KMT2C, KMT2D       12/89
#>   5:  CDK12, FOXA1       11/94
#>  ---                          
#>  [ reached getOption("max.print") -- omitted 5 rows ]

Copy number

Copy number segment distribution

The distribution can be viewed as a histgram of segment length.

Length summary is:

If we treat SCNA with length > 0.7 as chromosome arm level or whole chromosome level alterations.

The fraction is:

The distribution can also be viewed from the alteration load across chromosomes.

Component distribution

Data details can be viewed as a table:

n_obs column indicates the number of event of corresponding copy number component.

Supplementary Analysis and Visualization

This part shows some supplementary analyses and descriptions.

FACETS results

We used 2 methods (Sequenza and FACETS) for copy number calling. However, only Sequenza was used for previous downstream analyses due to the fact that FACETS generated some unreasonable copy number profile for a few samples. We will talk it here.

Let’s load the packages and signature data from FACETS and Sequenza.

According to the estimation of signature number, 6 signatures were extracted.

The result signatures are shown as the below.

FACETS result has one special and extra signature (Sig1) when compared to Sequenza result. The Sig1 shows copy number segments with high percentage of component “copy number value = 0”.

Other signatures are not identical but very similar in most features beteween FACETS and Sequenza results.

We check some samples with Sig1 enriched.

We can observe that the samples with Sig1 enriched have many homozygous deletions, this is unreasonable. I made a communication with the author of FACETS (issue link),

So only logical explanation is homozygous deletion. However they are too many of them and so I believe this is assay artifact rather than real.

This problem seems to come from the raw data. However, we checked the data and this signature cannot be found from the result of Sequenza.

To summary, we used result from Sequenza instead of FACETS for downstream analyses because we found there may be some unknown method problems in FACETS (in general, this method is still reliable).

Method comparison between Macintyre et al and our study

In section “Tally variation records”, I said that our method (defined as “Wang” or “W”) has some advantages over the method used in Macintyre et al (defined as “Macintyre” or “M”).

I will talk the details here.

Firsty let’s load the results from “M” method (the code to generate the data can be found here).

Sample-by-component matrix generation

The key difference between method “W” and method “M” is the way to generate the sample-by-component matrix.

  • Method “W” defines the components and counts them (details see section “Tally variation records”).
  • Method “M” applies mixture modeling to separate the 6 copy number features distributions into mixtures of Poisson or Gaussian distributions.

Here we found that method “M” has the following drawbacks:

  1. the computation needs huge computing resources. For generating the matrix used for NMF algorithm, it takes about 5100s (~1.5h) with 20 cores for method “M” while only 86s (~1.5m) with 20 cores for method “W”.
  2. the number of component is dependent on the raw data. If we change the sample size or apply the method “M” to another cohort, the number of component may change. Method “W” will not be affected by this because it uses a fixed number of component.
  3. the distribution parameters for each distribution are not identical for different runs. Even for the same data, the result distributions and sample-by-component matrix may change.

At the start of this study, I used the method “M” for all analyses. Finally, we created method “W” based on the use and rethinking of method “M” and SBS signature extraction method from previous reports.

Next section, I will talk more problems raised by method “M” from the visualization view.

Copy number signatures from “M” method

We extracted 5 signatures like method “W”. Let’s see their signature profile.

This is a bar plot similar to the one reported by Macintyre et al.

This plot raises some issues.

Issue 1

The matrix used to generate the plot is column normalized, so the heights of bar cannot be compared with each other.

For example, firstly we see the Sig3, we may conclude that component “bp10MB2” has more contributions than component “bp10MB1”. But if we see the raw matrix, we can see that this is wrong!

We cannot compare two or more componets in a row when we normalize the data by column! The plot above may misguide the readers. For the plot above, we can only compare the same component across all signatures. For example, we can conclude that Sig5 captures the majority of component “segsize12” and “segsize11”.

To avoid misguiding readers, gradient colors may work. Also, we must tell readers in text that we should compare one component in a column insteads two or more componets in a row.

Issue 2

From plot above, we can only see different components, but we cannot see the meaning of each component. For method “M”, each component represents a distribution.

For example, what’s the meaning of the component “copynumber2”? To fix this issue, we tried adding the key parameter value in the plot.

According to the text just added, we can see that component “copynumber2” represents a Gaussian distribution with mean is 2; component “copynumber1” represents a Gaussian distribution with mean is 0.84.

Issue 3

Even with the plot above, we cannot know the dispersion of each component. All distributions with weight for each copy number feature can be viewed as the below.

Why we created method “W”

Here, I want to talk something more.

If we take case of all available values for each features, we would notice that there are only limited choices. Let’s visualize them.

Naturally we can divide them into categories and get a fixed number of components. It would be easy to read, understand and compare. Look back this part, why we choose a complicated method with many drawbacks instead of a simple one?

Some validation analysis in other datasets

We selected 1 PC datasets with copy number segmentation data from cBioPortal:

The segmentation data were converted into absolute copy number data by ABSOLUTE method (Carter et al. 2012) with DoAbsolute. Signatures were extracted with “W” method.

Similarity matrix

Here show results of similarity calculation signatures.

sig.seqz = sig_signature(Sig.CNV.seqz.W, normalize = "feature") %>% as.data.frame()
sig.mskcc = sig_signature(Sig.CNV.mskcc, normalize = "feature")[, c(2,1,3:5)] %>% as.data.frame()


cor.test(sig.seqz$Sig1, sig.mskcc$Sig1)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  sig.seqz$Sig1 and sig.mskcc$Sig1
#> t = 8.6516, df = 78, p-value = 5.122e-13
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.5672875 0.7969625
#> sample estimates:
#>       cor 
#> 0.6997817
cor.test(sig.seqz$Sig2, sig.mskcc$Sig2)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  sig.seqz$Sig2 and sig.mskcc$Sig2
#> t = 3.3769, df = 78, p-value = 0.001146
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.1491269 0.5348861
#> sample estimates:
#>       cor 
#> 0.3571418
cor.test(sig.seqz$Sig3, sig.mskcc$Sig3)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  sig.seqz$Sig3 and sig.mskcc$Sig3
#> t = 15.821, df = 78, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.8085723 0.9169635
#> sample estimates:
#>       cor 
#> 0.8731653
cor.test(sig.seqz$Sig4, sig.mskcc$Sig4)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  sig.seqz$Sig4 and sig.mskcc$Sig4
#> t = 16.486, df = 78, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.8207251 0.9225298
#> sample estimates:
#>       cor 
#> 0.8814861
cor.test(sig.seqz$Sig5, sig.mskcc$Sig5)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  sig.seqz$Sig5 and sig.mskcc$Sig5
#> t = 30.086, df = 78, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.9374232 0.9739083
#> sample estimates:
#>       cor 
#> 0.9595117

Further validate with heatmap of signature component weight.

Survival analysis

To better understand the signatures, here we evaluate the association between signature exposure and prognosis.

Firstly, get the signature exposure.

Then we get survival status, patient IDs, sample IDs etc.

Merge data before doing survival analysis with ezcox package.

Clean. Still scale exposure by dividing 10. (Keep consistent with scaling method for previous dataset.)

Now we do survival analysis.

Check primary and metastatic samples independently.

Test the correlation with fisher test.

Reference

Alexandrov, Ludmil B, Serena Nik-Zainal, David C Wedge, Samuel AJR Aparicio, Sam Behjati, Andrew V Biankin, Graham R Bignell, et al. 2013. “Signatures of Mutational Processes in Human Cancer.” Nature 500 (7463). Nature Publishing Group: 415.

Armenia, Joshua, Stephanie AM Wankowicz, David Liu, Jianjiong Gao, Ritika Kundra, Ed Reznik, Walid K Chatila, et al. 2018. “The Long Tail of Oncogenic Drivers in Prostate Cancer.” Nature Genetics 50 (5). Nature Publishing Group: 645.

Carter, Scott L, Kristian Cibulskis, Elena Helman, Aaron McKenna, Hui Shen, Travis Zack, Peter W Laird, et al. 2012. “Absolute Quantification of Somatic Dna Alterations in Human Cancer.” Nature Biotechnology 30 (5). Nature Publishing Group: 413.

Favero, Francesco, Tejal Joshi, Andrea Marion Marquard, Nicolai Juul Birkbak, Marcin Krzystanek, Qiyuan Li, Z Szallasi, and Aron Charles Eklund. 2014. “Sequenza: Allele-Specific Copy Number and Mutation Profiles from Tumor Sequencing Data.” Annals of Oncology 26 (1). European Society for Medical Oncology: 64–70.

Gaujoux, Renaud, and Cathal Seoighe. 2010. “A Flexible R Package for Nonnegative Matrix Factorization.” BMC Bioinformatics 11 (1). BioMed Central: 367.

Krueger, Felix. 2015. “Trim Galore.” A Wrapper Tool Around Cutadapt and FastQC to Consistently Apply Quality and Adapter Trimming to FastQ Files.

Li, Heng. 2013. “Aligning Sequence Reads, Clone Sequences and Assembly Contigs with Bwa-Mem.” arXiv Preprint arXiv:1303.3997.

Li, Heng, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, and Richard Durbin. 2009. “The Sequence Alignment/Map Format and Samtools.” Bioinformatics 25 (16). Oxford University Press: 2078–9.

Macintyre, Geoff, Teodora E Goranova, Dilrini De Silva, Darren Ennis, Anna M Piskorz, Matthew Eldridge, Daoud Sie, et al. 2018. “Copy Number Signatures and Mutational Processes in Ovarian Carcinoma.” Nature Genetics 50 (9). Nature Publishing Group: 1262.

Mayakonda, Anand, and H Phillip Koeffler. 2016. “Maftools: Efficient Analysis, Visualization and Summarization of Maf Files from Large-Scale Cohort Based Cancer Studies.” BioRxiv. Cold Spring Harbor Laboratory, 052662.

McLaren, William, Laurent Gil, Sarah E Hunt, Harpreet Singh Riat, Graham RS Ritchie, Anja Thormann, Paul Flicek, and Fiona Cunningham. 2016. “The Ensembl Variant Effect Predictor.” Genome Biology 17 (1). BioMed Central: 122.

“Picard Toolkit.” 2019. Broad Institute, GitHub Repository. http://broadinstitute.github.io/picard/; Broad Institute.

Shen, Ronglai, and Venkatraman E. Seshan. 2016. “FACETS: Allele-Specific Copy Number and Clonal Heterogeneity Analysis Tool for High-Throughput DNA Sequencing.” Nucleic Acids Research 44 (16): e131–e131. https://doi.org/10.1093/nar/gkw520.

Van der Auwera, Geraldine A, Mauricio O Carneiro, Christopher Hartl, Ryan Poplin, Guillermo Del Angel, Ami Levy-Moonshine, Tadeusz Jordan, et al. 2013. “From Fastq Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline.” Current Protocols in Bioinformatics 43 (1). Wiley Online Library: 11–10.

Wang, Shixiang, and Xuesong Liu. 2019. “The Ucscxenatools R Package: A Toolkit for Accessing Genomics Data from Ucsc Xena Platform, from Cancer Multi-Omics to Single-Cell Rna-Seq.” Journal of Open Source Software 4 (40): 1627.

Yi, Kijong, and Young Seok Ju. 2018. “Patterns and Mechanisms of Structural Variations in Human Cancer.” Experimental & Molecular Medicine 50 (8). Nature Publishing Group: 98.