Tumor (n = 4,366) Normal (n = 364)

Size: px
Start display at page:

Download "Tumor (n = 4,366) Normal (n = 364)"

Transcription

1 Tumor (n = 4,366) Normal (n = 364) PRADA: Pipeline for RNA sequencing Data Analysis Identify fusion transcripts with - at least 2 discordant read pairs (DRP) - at least 1 perfect junction spanning read (JSR) 26,995 fusion pairs 2,555 fusion pairs Filtered by gene homology (E-value >.1) 2,857 fusion pairs 2,116 fusion pairs Filtered by transcript allele fraction (TAF-A >.1 & TAF-B >.1) 11,14 fusion pairs 34 fusion pairs Filtered by partner gene variety (p <.1) 9,47 fusion pairs 192 fusion pairs 8,695 tumor specific fusion pairs 158 kinds of fusions in normal tissue 352 tumor fusion pairs (42 kinds of fusions) were overlapped. Supplementary Figure 1. A flowchart of fusion filtering To obtain a bona-fide fusion, fusion filtering was performed based on the number of discorndant read pairs, perfect match junction spanning reads, gene homology, transcript allele fraction, and partner gene variety. By using RNA seq data obtained from normal samples, tumor-specific fusions were extracted.

2 BLCA BRCA HNSC KIRC * 2268 genes 4572 genes 3472 genes LUAD LUSC THCA 3787 genes * ** * Five normal samples * located in tumor cluster 462 genes 5353 genes 2933 genes Supplementary Figure 2. Detection of normal samples with a high likelihood of tumor cell contamination. Supervised clustering was performed by using different expressed gene between tumor and normal samples. Five normal samples (red asterisk) were clustered into tumor cluster.

3 A 1. Gene A.8 Fraction BLCA BRCA GBM HNSC KIRC LAML LGG LUAD LUSC OV PRAD SKCM THCA B 1. Gene B.8 Fraction BLCA BRCA GBM HNSC KIRC LAML LGG LUAD LUSC OV PRAD SKCM THCA high-level amplification low-level amplification no amplification or deletion low-level deletion high-level deletion NA Supplementary Figure 3 Copy number status of each gene for 13 tumor types Bar plots represent the fraction of five types of copy number status for each tumor type. Gene-level copy number alteration was identified by applying GISTIC 2. to TCGA level 3 copy number data for each tumor type. A few genes constituting of fusions were not available in GISTIC 2. due to incompatibility of gene symbols between fusion and copy number data and these genes were annotated as NA.

4 A gene A gene B In-frame 5UTR-CDS Out-of-frame ESR antibody (ps118) antibody (c-terminal) 1. Modulating (Transactivation AF-1) (1-184) 2. DNA binding (185-25) 3. Hinge (251-31) 4. Transactivation AF2 ( ) & Steroid binding ( ) B 1. Modulating (Transactivation AF-1) (1-184) ESR1 C6orf97 2. DNA binding (185-25) ESR1 USP25 ESR1 BNC2 3. Hinge (251-31) ESR1 ASPH ESR1 C6orf97 4. Transactivation AF2 ( ) & Steroid binding ( ) ESR1 MTHFD1L C ER alpha ER alpha ps junction point junction point Supplementary Figure 4. The details of ESR1 fusions in breast cancer. (A) Dot plots with ESR1 transcript structure demonstrate the frequency of ESR1 fusions and junction points for each fusion. (B) Exon expression plots show the Z-normalized exon expression level for six ESR1 fusions located on functional domain (C) Bar plots with boxwhisker plots represent RPPA protein expression level for ER alpha (left) and ER alpha ps118 in the overlapped breast cancer samples between fusion and RPPA data sets. Box-Whisker plot represent the mean of protein expression in ESR1 fusion negative samples.

5 A Fraction B Fraction of fusions with CN In-frame CDS-5UTR CDS-3UTR 5UTR-CDS 5UTR-5UTR 5UTR-3UTR 3UTR-CDS 3UTR-5UTR 3UTR-3UTR Out-of-frame not classified SKCM KIRC LUSC BLCA BRCA LUAD OV PRAD HNSC LGG GBM THCA LAML Intrachromosomal (< 1Mb) with / without CN Intrachromosomal (> 1Mb & <1Mb) with / without CN Intrachromosomal (> 1Mb) with / without CN Interchromosomal with / without CN. Fraction of fusions without CN.4.8 THCA KIRC LAML LGG OV HNSC SKCM GBM LUAD BRCA LUSC PRAD BLCA Supplementary Figure 5. The distribution of fusion transcripts inferred as in-frame (A) Bar plots show the fraction of in-frame fusions. Tumor types were sorted by the fraction of in-frame fusion. (B) Each bar represents the fraction of eight types of in-frame fusions. Tumor types were sorted by the fraction of samples in which at least a single fusion transcript was detected per tumor type.

6 BLCA (n = 94) * (n>5) 5 * significant mutation (right axis) 3.6e per Mp Supplementary Figure 6. Relationship between fusion and mutation events for each tumor type. The frequencies of somatic mutations (lightblue) and significant mutation (pink) were shown as bar. To compare the frequency between samples s t test was performed. A heatmap shows recurrent fusions (green) and significant mutation events in each tumor types. Prostate cancer and melanoma samples with recurrent fusions had no available mutation data.

7 Supplementary Fig. 6 per Mb BRCA (n=754) significant mutation (right axis) Welch s t-test, p =.35 Welch s t-test, p =

8 Supplementary Fig. 6 GBM (n=147) significant mutation (right axis) Welch s t-test, p =.97 Welch s t-test, p =.89 per Mb

9 Supplementary Fig. 6 6 HNSC (n = 296) Welch s t-test, p =.23 significant mutation (right axis) Welch s t-test, p =.3 * (n>6) *.6 per Mb

10 Supplementary Fig. 6 5 KIRC (n = 4) significant mutation (right axis) Welch s t-test, p =.39 Welch s t-test, p =.63.5 per Mb

11 Supplementary Fig. 6 2 LAML (n = 167) significant mutation (right axis) Welch s t-test, p =.48 Welch s t-test, p = 4.8e-15.2 per Mb: 1.1

12 Supplementary Fig. 6 LUAD (n = 171) * (n>1) * significant mutation (right axis) Welch s t-test, p =.23 Welch s t-test, p =.32 per Mb

13 Supplementary Fig.6 1 LUSC (n = 177) * (n>1) significant mutation (right axis) Welch s t-test, p =.72 Welch s t-test, p =.32 * 1. per Mb

14 Supplementary Fig. 6 6 OV (n = 266) significant mutation (right axis) Welch s t-test, p =.26 Welch s t-test, p = per Mb CCDC6_ANK3 KAT6B_ADK TP53 NF1 BRCA1 CDK12 RB1

15 Supplementary Fig. 6 THCA (n = 318) Welch s t-test, p =.14 5 significant mutation (right axis) Welch s t-test, p = 8.3e-88.5 per Mb CCDC6_RET ETV6_NTRK3 PAX8_PPARG NCOA4_RET PPARG_PAX8 ERC1_RET BRAF NRAS HRAS NUDT11 EIF1AX

16 A gene A gene B KDM5A Jumonji N domain (19-6) 2. ARID domain (84-174) 3. Jumonji C domain (437-63) B LUAD KDM5A RAE1 BRCA KDM5A NINJ2 BRCA KDM5A ANO2 BRCA WNK1 KDM5A Supplementary Figure 7. KDM5A fusions as a therapeutic target (A) Position of each domain in KDM5A gene and junction points of KDM5A fusions. (B) Exon expression plots demonstrated Z-normalized exon expression for each exon in thyroid cancers. Red and blue represent relatively high and low exon expression.

17 read junction point M Fused transcript A Fused transcript B exon exon exon exon N A N B exon exon exon exon exon exon Normal transcript A Normal transcript B Transcript allele fraction (TAF) for transcript A = M M + N A Transcript allele fraction (TAF) for transcript B = M M + N B Supplementary Figure 8. Transcript allele fraction (TAF) This image shows the definition of transcript allele fraction (TAF). The TAF score is measured as the ratio of junction spanning reads to total reads crossing over junction point mapped to the reference transcript.

18 A PML in LAML COL3A1 in BRCA Variety of partner gene = 1 (Number of fusions = 11) Variety of partner gene = 18 (Number of fusions = 28) B 5 Variety of partner genes random distribution (no specific partner) Number of fusions comprising one specific gene Supplementary Figure 9. Diversity of partner gene variety (A) Circos plots show representative examples indicating low (PML in acute myeloid leukemia) and high (COL3A1 in breast cancer) partner gene variety. (B) The graph demonstrates random distribution of partner gene variety. Red and grey dashed lines depict the maximum and mean of PGV per number of fusion.

19 RNA fusion transcript exon exon exon exon break point junction point DNA exon exon exon exon exon window exon High confidence read paired read Medium confidence read or read Supplementary Figure 1. The concept of validating fusion transcripts To consider the difference in structure between DNA and RNA, breakpoints (yellow) were searched within the window from the inferred junction point. High confidence validation means the presence of paired end reads showing breakpoints within the windown from junction point.