Supplementary Figure 1 Processing of mutations and generation of simulated controls. On the left, a diagram illustrates the manner in which covariate-matched simulated mutations were obtained, filtered to remove potential false positives from mapping errors and split into experimental and validation subsets. The panels on the upper right shows the fraction of mutations in each RegulomeDB category that were filtered out owing to a high mismap score. Also depicted is a Venn diagram showing the number of mutations filtered out as potential false positives from mapping errors as well as the overlap of these mutations with difficult-to-align regions of the genome. These mutations are enriched in category 3b as well as in regions with no regulatory annotations (6 and 7). The panel on the middle right shows the breakdown of transcript annotations for real and simulated mutations in each RegulomeDB category. The panel on the bottom right shows the distributions of replication timing and base-pair composition for simulated and real mutations for each cancer type. The panel on the bottom left shows the similarity in the distributions of the number of mutations per sample for the experimental and validation subsets in each cancer type.
Supplementary Figure2 Mutation calling quality metrics. (a) The distribution of the variant allele fraction for each cancer type is shown via violin plots. (b) A scatter plot showing the relationship between genome sequencing file size and number of mutations called for that sample. (c) Box plots and overlaid points depict the median coverage of each sample grouped by cancer type. BRCA, breast invasive carcinoma; GBM, glioblastoma multiforme; HNSC, head and neck squamous cell carcinoma; KIRC, kidney renal clear cell carcinoma; LUAD, lung adenocarcinoma; LUSC, lung squamous cell carcinoma; OV, ovarian serous cystadenocarcinoma; UCEC, uterine corpus endometrial carcinoma.
Supplementary Figure 3 Similarity of sets of transcription factor bound mutations. For each pair of transcription factors shown in Figure 2g, the Jaccard similarity was computed on the basis of the overlap in the genomic positions mutated with RegulomeDB transcription factor annotations for the two factors. Factors were clustered on the basis of this similarity score, and the scores are plotted here as a heat map. The average enrichment score of real versus simulated mutations for all cancer types for each transcription factor is shown below the transcription factor labels on the x axis.
Supplementary Figure 4 Mutational patterns in transcription factor binding sites. (a) An analysis was performed to identify all transcription factor motifs with an increased match score in mutant sites compared to reference sites. Only mutations in sites for the CEBP factors were used for this analysis. (b) The sequences surrounding the mutations were aligned using TTG(T/C) as the seed. This seed motif, the aligned reference and the aligned mutant sequences are shown as well as a histogram of the number and type of mutations at each position. (c) The most common sequences of eight bases in length contributing to the motif in b are shown. (d) The counts of mutations from these sites by patient are shown. One patient with UCEC has a disproportionate number of these mutant sites. (e) Box plots of RNA-seq expression values for samples with and without CEBP mutations are show for the factors matching CEBP motifs or motifs with a higher match score in a. (f) Seed, reference and variant alignments as well as mutation counts by position are shown for the factors from Figure 3f.
Supplementary Figure 5 Mutation probability fitting and model validation test. (a) Logistic regression allows for the calculation of the probability of mutation conditioned on replication timing, base-pair composition, transcript type and patient ID. Box plots of predicted probabilities across all patients are shown for the various combinations of transcript region, base-pair type and replication timing bin. (b) The fraction of sites identified in the validation set that can be found in the experimental set and vice versa are plotted, showing the robustness of the method even with a small number of patient samples. (c) A box plot depicting the difference in log 10 RNA-seq expression data for PLCXD1 in samples either with or without a mutation at chr. X: 197,480. P value was determined by the bootstrap method, as the data were not normally distributed.
Supplementary Figure 6 Screening for functional mutated regulatory elements. Wild-type and mutant versions of four control regions and ten repeatedly mutated regulatory regions, including one of the TERT promoter mutations, were assayed for their ability to enhance the transcriptional activity of a minimal promoter using a luciferase assay. Constructs were assayed in NCI-H1437 lung adenocarcinoma cells.