Supplementary Information Supplementary Figures

Size: px
Start display at page:

Download "Supplementary Information Supplementary Figures"

Transcription

1 Supplementary Information Supplementary Figures Supplementary Figure 1. Frequency of the most highly recurrent gene fusions in 333 prostate cancer patients from the TCGA. The Y-axis shows numbers of patients. Previously discovered gene fusions are marked blue and de novo gene fusions are marked black. Gene partners for the de novo fusions are marked red if they have been previously shown to associate with prostate cancer.

2 Supplementary Figure 2. CIRCOS plot of the recurrent gene fusions. The inter-chromosomal gene fusions are shown as orange links and the intrachromosomal gene fusions are shown as blue links. For inter-chromosomal gene fusions, the link widths are scaled according to how many patients harbor the gene fusion within the TCGA PRAD cohort. No scaling is used for intrachromosomal gene fusions due to space limitations. The gene names are provided for the genes that are most frequently involved in recurrent gene fusions.

3 Supplementary Figure 3. Application of INTEGRATE-Neo to the TCGA PRAD cohort. (a) Percentage of gene fusion neoantigens. 15% of gene fusion peptides are predicted to produce neoantigens in prostate cancer. (b) Epitope length distribution. The epitopes include all the amino acid lengths used (8-11 amino acids). 9 and 10 are the most frequently predicted epitope lengths. (c) Binding affinity score distribution. The epitope binding affinity scores of the gene fusion neoantigens are skewed toward smaller values (higher binding affinities). (d) Plot of epitope length and binding affinity score. The score distribution pattern also holds in specific epitope lengths.

4 Supplementary Figure 4. Epitopes predicted for a TMPRSS2-ERG gene fusion transcript. The fusion transcript is between exon 2 of the TMPRSS2 transcript (blue) with Ensembl Id ENST and exon 4 of the ERG transcript (red) with Ensembl Id ENST It is in-frame at the fusion junction. The fusion junction is supported by spanning RNA-seq reads, ranging from 3 to 58, in different TCGA patients. Different epitopes are prediced with varying binding affinities in patients with differet HLA alleles.

5 Supplementary Figure 5. Epitope affinities by HLA alleles and recurrent gene fusions. (a) Boxplot of epitope affinities and HLA alleles. The most frequent HLA alleles, i.e., binding with 5 epitopes, are included. Gene fusion epitope binding scores for certain HLA alleles can be stronger than some other HLA alleles. (b) Boxplot of epitope affinities and the recurrent status of gene fusions. Gene fusion epitope binding scores for singleton gene fusions (mean=163.2 nm) are not significantly different (p=0.22, Welch Two Sample t- test) from binding scores (mean=132.1 nm) for recurrent gene fusions.

6 Supplemenary Figure 6. Runtime and memory usage of INTEGRATE-Neo. Time is measured in seconds and space is measured in Giga bytes (GB).

7 Supplementary Methods 1. Discovery of gene fusion neoantigens with TCGA PRAD data The RNA-seq data of a cohort of 333 prostate cancer patients generated by The Cancer Genome Atlas Research Network ( were aligned to the human reference genome (GRCh38) using the Genome Model System (Griffith, et al., 2015) with TopHat2 v2.0.8 (Kim, et al., 2013). The BAM files and Ensembl v85 (Hubbard, et al., 2002) gene models were provided to INTEGRATE v0.2.6, which was run using its RNA-seq only mode to discover gene fusions (Zhang, et al., 2016). INTEGRATE-Neo predicted the peptide expression of the gene fusions using the RNA-seq data from the TCGA PRAD cohort. However, by using RNA-seq it is possible that some gene fusions may be expressed at low levels or nonsense mediated decay may be activated. As done previously (Angelova, et al., 2015; Brown, et al., 2014), BWA v (Li and Durbin, 2009) was used with default parameters to align the RNA-seq reads from the primary tumor samples to the reference HLA alleles provided by the HLAminer v 1.3 package (Warren, et al., 2012). HLA alleles of the prostate cancer patients were predicted with HLAminer v 1.3 (Warren, et al., 2012) with default parameters using these SAM files. The output TSV and log files from HLAminer were processed with the HLAminerToTsv tool in the conversion module of INTEGRATE-Neo. The columns include four-digit HLA allele name, source, score, e-value, and confidence. Using these TSV files and the BEDPE files from INTEGRATE as input into INTEGRATE-Neo v 1.1 we were able to discover gene fusion neoantigens and their epitopes for each patient. Reference Angelova, M., et al. (2015) Characterization of the immunophenotypes and antigenomes of colorectal cancers reveals distinct tumor escape mechanisms and novel targets for immunotherapy. Genome Biol., 16, 64. Brown, S.D., et al. (2014) Neo-antigens predicted by tumor genome metaanalysis correlate with increased patient survival. Genome Res., 24(5), Griffith, M., et al. (2015) Genome Modeling System: A Knowledge Management Platform for Genomics. PLoS Comput. Biol., 11(7), e Hubbard, T., et al. (2002) The Ensembl genome database project. Nucleic Acids Res., 30(1), Kim, D., et al. (2013) TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol., 14(4), R36. Li, H. and Durbin, R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14), Warren, R.L., et al. (2012) Derivation of HLA types from shotgun sequence datasets. Genome Med., 4(12), 95. Zhang, J., et al. (2016) INTEGRATE: gene fusion discovery using whole genome and transcriptome data. Genome Res., 26(1),