Supplementary Data for DNA sequence+shape kernel enables alignment-free modeling of transcription factor binding.

Similar documents
Transcription factor binding site prediction in vivo using DNA sequence and shape features

Supplementary Data Bioconductor Vignette for DNAshapeR Package. DNAshapeR: an R/Bioconductor package for DNA shape prediction and feature encoding

Genomic Regions Flanking E-Box Binding Sites Influence DNA Binding Specificity of bhlh Transcription Factors through DNA Shape

Nature Biotechnology: doi: /nbt Supplementary Figure 1

Article Predicting Variation of DNA Shape Preferences in Protein-DNA Interaction in Cancer Cells with a New Biophysical Model

Gene Regulation Solutions. Microarrays and Next-Generation Sequencing

Many transcription factors! recognize DNA shape

Analysis of Microarray Data

Learning a Weighted Sequence Model of the Nucleosome Core and Linker Yields More Accurate Predictions in Saccharomyces cerevisiae and Homo sapiens

Lecture #1. Introduction to microarray technology

Mentor: Dr. Bino John

Humboldt Universität zu Berlin. Grundlagen der Bioinformatik SS Microarrays. Lecture

Supplementary Figure 1

RNA-Sequencing analysis

MCB 110:Biochemistry of the Central Dogma of MB. MCB 110:Biochemistry of the Central Dogma of MB

Introduction to gene expression microarray data analysis

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

Supplementary Figure 1: sgrna library generation and the length of sgrnas for the functional screen. (a) A diagram of the retroviral vector for sgrna

Sequence Databases and database scanning

RNA-Seq analysis using R: Differential expression and transcriptome assembly

BayesPI-BAR: a new biophysical model for characterization of regulatory sequence variations

Gene Expression Profiling and Validation Using Agilent SurePrint G3 Gene Expression Arrays

Discovery of Transcription Factor Binding Sites with Deep Convolutional Neural Networks

Convolutional Kitchen Sinks for Transcription Factor Binding Site Prediction

Methods of Biomaterials Testing Lesson 3-5. Biochemical Methods - Molecular Biology -

What Are the Chemical Structures and Functions of Nucleic Acids?

Lecture 7: April 7, 2005

DNA Microarray Data Oligonucleotide Arrays

Analysis of Microarray Data

Chapter 1 Analysis of ChIP-Seq Data with Partek Genomics Suite 6.6

The Next Generation of Transcription Factor Binding Site Prediction

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

TERTIARY MOTIF INTERACTIONS ON RNA STRUCTURE

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

Outline. Analysis of Microarray Data. Most important design question. General experimental issues

How to view Results with Scaffold. Proteomics Shared Resource

Sequence Analysis Lab Protocol

Introduction to Bioinformatics and Gene Expression Technologies

Gene Expression Data Analysis (I)

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

SUPPLEMENTAL MATERIALS

Introduction to Bioinformatics. Fabian Hoti 6.10.

Gene Expression Technology

Ana Teresa Freitas 2016/2017

6. GENE EXPRESSION ANALYSIS MICROARRAYS

CSE : Computational Issues in Molecular Biology. Lecture 19. Spring 2004

CRISPR/Cas9 Genome Editing: Transfection Methods

Supporting Information

Bio11 Announcements. Ch 21: DNA Biology and Technology. DNA Functions. DNA and RNA Structure. How do DNA and RNA differ? What are genes?

TraceTools: a new DNA basecaller

Next-Generation Sequencing Gene Expression Analysis Using Agilent GeneSpring GX

Lecture 2: Central Dogma of Molecular Biology & Intro to Programming

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

Genetically targeted all-optical electrophysiology with a transgenic Credependent

Multiplex Assay Design

Functional Genomics Overview RORY STARK PRINCIPAL BIOINFORMATICS ANALYST CRUK CAMBRIDGE INSTITUTE 18 SEPTEMBER 2017

Description of Logit-t: Detecting Differentially Expressed Genes Using Probe-Level Data

Genome Sequence Assembly

SAMPLE LITERATURE Please refer to included weblink for correct version.

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material

Long and short/small RNA-seq data analysis

Predicting Microarray Signals by Physical Modeling. Josh Deutsch. University of California. Santa Cruz

Molecular Biology Primer. CptS 580, Computational Genomics, Spring 09

Exam MOL3007 Functional Genomics

RNA Sequencing Analyses & Mapping Uncertainty

EECS730: Introduction to Bioinformatics

3.1.4 DNA Microarray Technology

Year III Pharm.D Dr. V. Chitra

4.1. Genetics as a Tool in Anthropology

CURRICULUM VITAE October 10, 2017 Dr. Remo Rohs

GABPa Binding to Overlapping ETS and CRE DNA Motifs Is Enhanced by CREB1: Custom DNA Microarrays

Technical note: Molecular Index counting adjustment methods

Accelerate High Throughput Analysis for Genome Sequencing with GPU

Introduction to BIOINFORMATICS

Read the question carefully before answering. Think before you write. If I can not read your handwriting, I will count the question wrong.

Genes - DNA - Chromosome. Chutima Talabnin Ph.D. School of Biochemistry,Institute of Science, Suranaree University of Technology

Experimental maps of DNA structure at nucleotide resolution distinguish intrinsic from protein-induced DNA deformations

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Number and length distributions of the inferred fosmids.

Barcode Sequence Alignment and Statistical Analysis (Barcas) tool

Network System Inference

De Novo Assembly of High-throughput Short Read Sequences

Procedia - Social and Behavioral Sciences 97 ( 2013 )

Understanding DNA Structure

Genome Architecture Structural Subdivisons

Péter Antal Ádám Arany Bence Bolgár András Gézsi Gergely Hajós Gábor Hullám Péter Marx András Millinghoffer László Poppe Péter Sárközy BIOINFORMATICS

Structural Bioinformatics (C3210) DNA and RNA Structure

Computational Biology I LSM5191

An Investigation of Palindromic Sequences in the Pseudomonas fluorescens SBW25 Genome Bachelor of Science Honors Thesis

Improving CRISPR-Cas9 Gene Knockout with a Validated Guide RNA Algorithm

Protein Structure Prediction. christian studer , EPFL

Mate-pair library data improves genome assembly

Axiom mydesign Custom Array design guide for human genotyping applications

Pyrosequencing. Alix Groom

Why learn sequence database searching? Searching Molecular Databases with BLAST

CFSSP: Chou and Fasman Secondary Structure Prediction server

Introduction to BioMEMS & Medical Microdevices DNA Microarrays and Lab-on-a-Chip Methods

The human noncoding genome defined by genetic diversity

A GENOTYPE CALLING ALGORITHM FOR AFFYMETRIX SNP ARRAYS

Protein Folding Problem I400: Introduction to Bioinformatics

Transcription:

Supplementary Data for DNA sequence+shape kernel enables alignment-free modeling of transcription factor binding. Wenxiu Ma 1, Lin Yang 2, Remo Rohs 2, and William Stafford Noble 3 1 Department of Statistics, University of California Riverside, Riverside, CA 92521, USA 2 Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA 3 Department of Genome Sciences, Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195, USA. 1

Supplementary Notes 1 Computation of the DNA shape features Recently, the Rohs lab has developed DNAshape, a computational approach to predicting the three-dimensional double-helix DNA structure features from one-dimensional nucleotide sequences (Zhou et al., 2013; Chiu et al., 2016). The DNAshape method uses a five-base sliding window method where the double helix DNA structural features unique to each of the 512 distinct pentamers are defined by minor groove width (MGW), Roll, propeller twist (ProT), and helix twist (HelT). MGW and ProT represent base-pair parameters whereas Roll and HelT represent base pair-step parameters. The values for four DNA shape feature as function of its pentamer sequence were derived from Monte Carlo simulations. The DNAshape method is computationally efficient and its predictions are compatible with DNA structures solved by X-ray crystallography and NMR spectroscopy, hydroxyl radical cleavage data, statistical analysis and cross-validation, and molecular dynamics simulations. The DNAshape prediction method is publicly available at the DNAshape web server (http://rohslab.cmb.usc.edu/dnashape) and also in the DNAshapeR Bioconductor package (https://bioconductor.org/packages/release/bioc/html/dnashaper.html). 2 Comparison between compositional kernel and positional kernel 2.1 Pre-processing details We summarized the difference of the pre-processing details of the DREAM5 universal PBM (upbm) data between the Zhou et al. (2015) study and our study, in Supplementary Table 2. Starting from the raw PBM data, there are 40, 330 40, 526 probes (each 35 bp long) on each array. After the initial normalization step, in our study we used the entire set of full-length probes as our training and testing data. In the Zhou et al. (2015) study, after all the pro-processing, filtering, trimming steps, there are 27 2, 700 probes left for each array, where the probes are of length 17 20 bp. 2.2 Dimensionality of kernel space We summarized the kernel space dimensionality of our compositional k-spectrum kernel and k-spectrum+shape kernel with the positional k-mer and k-mer+shape kernels proposed by Zhou et al. (2015), in Supplementary Table 3. In Zhou et al. (2015), the 4L 14 shape features were further expanded to include second-order shape feature by adding products of the same shape parameter at two adjacent positions. To facilitate a direct comparison with the k-spectrum+shape kernel, we didn t include the second-order shape features. In the following discussion, the k-mer+shape kernel only contains the first-order shape features. As we can see, the positional kernel space depends on the input nucleotide length L. For a typical PBM experiment, the raw probe length is about 25 35 bp and the preprocessed and trimmed probe length is about 17 20 bp. Thus, the positional k-mer and k-mer+shape kernels have much higher dimension. 2.3 Performance evaluation on the pre-processed DREAM5 data We compared our compositional k-spectrum+shape kernel with the positional k-mer+shape kernel (Zhou et al., 2015) on the pre-processed DREAM5 data. Since the pre-processed probes were well aligned at known motif sites, the positional kernel has better performance than our compositional kernel on these pre-processed data (Supplementary Figure 4). 2

2.4 Performance evaluation on the raw DREAM5 data We also compared our compositional k-spectrum+shape kernel with the positional k-mer+shape kernel (Zhou et al., 2015) on the raw DREAM5 data. As shown in Supplementary Figure 5, our compositional kernel has outperformed the positional kernel on the raw, unaligned probe data. Supplementary References Berger, M. F., Philippakis, A. A., Qureshi, A. M., He, F. S., Estep, P. W., and Bulyk, M. L. (2006). Compact, universal dna microarrays to comprehensively determine transcription-factor binding site specificities. Nature Biotechnology, 24(11), 1429 1435. Chiu, T.-P., Comoglio, F., Zhou, T., Yang, L., Paro, R., and Rohs, R. (2016). prediction and feature encoding. Bioinformatics, 32(8), 1211 1213. DNAshapeR: an R/Bioconductor package for DNA shape Gordân, R., Shen, N., Dror, I., Zhou, T., Horton, J., Rohs, R., and Bulyk, M. L. (2013). Genomic regions flanking E-box binding sites influence DNA binding specificity of transcription factors through DNA shape. Cell Reports, 3(4), 1093 1104. Weirauch, M. T., Cote, A., Norel, R., Annala, M., Zhao, Y., Riley, T. R., Saez-Rodriguez, J., Cokelaer, T., Vedenko, A., Talukder, S., DREAM5 Consortium (including W. S. Noble), Bussemaker, H. J., Morris, Q. D., Bulyk, M. L., Stolovitzky, G., and Hughes, T. R. (2013). Evaluation of methods for modeling transcription factor sequence specificity. Nature Biotechnology, 31(2), 126 134. Zhou, T., Yang, L., Lu, Y., Dror, I., Machado, A. C. D., Ghane, T., Di Felice, R., and Rohs, R. (2013). DNAshape: a method for the high-throughput prediction of DNA structural features on a genomic scale. Nucleic Acids Research, 41(W1), W56 W62. Zhou, T., Shen, N., Yang, L., Abe, N., Horton, J., Mann, R. S., Bussemaker, H. J., Gordân, R., and Rohs, R. (2015). Quantitative modeling of transcription factor binding specificities using dna shape. Proceedings of the National Academy of Sciences, 112(15), 4654 4659. 3

Supplementary Tables k-spectrum k-spectrum+shape di-mismatch di-mismatch+shape n (3 + 4k)n n 5n k = 1 2 14 k = 2 10 110 k = 3 32 480 32 160 k = 4 136 2,584 136 680 k = 5 512 11,776 512 2,560 k = 6 2,080 56,160 2,080 10,400 k = 7 8,192 253,952 8,192 40,960 k = 8 32,896 1,151,360 32,896 164,480 Supplementary Table 1: Number of features in different kernel space. Here n is the number of unique k-mers. Note that for the di-mismatch kernel and di-mismatch+shape kernel, the dimensionality of the kernel space do not depend on the di-mismatch parameter m. 4

Preprocessing steps 0. For each of the 66 mouse TFs, we took the median signal intensity values of the replicates of each probe on the PBM and then take log2 as the normalized upbm signal. 1. For each of the 66 mouse TFs, we obtained the normalized upbm signal intensities for all probes on the array, the 8-mer E-scores derived from the upbm data, and the best PWM for that TF, as reported by Weirauch et al. after analyzing PWMs obtained with 26 different algorithms (Weirauch et al., 2013) 2. For each of the 66 TFs, we scanned each upbm probe to identify the best PWM match on either the forward or reverse strand, accounting for all of the putative sites in the 35-bp variable probe region. The best PWM matches were used to align the upbm probes to each other. 3. For each TF, the selected probes were those for which the best PWM match fell at least L positions from the left end of the probe (corresponding to the free DNA end) and at least R positions from the right end of the scanned probe region. Restricting the location of the TF binding site by the L and R parameters minimized the effect of the positional bias on the upbm signal intensities used in the analyses. As shown in previous work (Gordân et al., 2013), flanking DNA sequences outside the PWM match can significantly affect TF binding and contribute to the PBM signal. For this reason, flanking regions outside the PWM match were included, as long as the PWM match fell within the limits defined by the L and R parameters. Several L/R pairs (L2R12, L2R15, L2R18, L5R5, L5R10, L5R12, L5R15, L5R18, L8R12, and L8R15) were tested. On average, L5R10 resulted in the most accurate models and, therefore, was chosen. However, using different L/R pairs did not markedly change the results of the comparisons between DNA shape-augmented models and traditional sequence-based models of DNA binding specificities. 4. The preceding step identified the best PWM match within each probe, within the limits defined by the L and R parameters, regardless of the PWM score. In this step, any probes for which the best PWM match did not correspond to a putative TF binding site (defined as a site containing at least two consecutive 8-mers with upbm E-score > ) were filtered out. This criterion is not a stringent cutoff for defining TF binding sites (Berger et al., 2006; Gordân et al., 2013), but it ensures that most of the selected probes do contain a specific TF binding site. 5. Although not common, some DNA probes on upbms can contain two or more TF binding sites. In such cases, it is not clear how much each binding site contributes to the PBM signal. To remove such probes from consideration, the probe region outside the central 12 bp (corresponding to the best PWM match) was scanned. Probes that contained a second potential binding site were filtered out. 6. Finally, for each TF, the selected probe sequences were trimmed to a length of T = Length(PWM)+2L bp, to ensure that the same amount of flanking sequence was used for each putative TF binding site. Zhou et al. (2015) Our study Supplementary Table 2: Preprocessing details for DREAM5 dataset. The preprocessing steps were copied from Zhou et al. (2015) with minor formatting changes. 5

compositional k-spectrum compositional k-spectrum+shape positional k-mer positional k-mer+shape n (3 + 4k)n 4 k (L k + 1) 4 k (L k + 1) + (4L 14) k = 1 2 14 4L 8L 14 k = 2 10 110 16L 16 20L 30 k = 3 32 480 64L 128 68L 142 Supplementary Table 3: Number of features in different kernel space. Here n is the number of unique k- mers. L is the length of the input nucleotide sequences. The positional k-mer+shape kernel only include first-order shape features. 6

k-spectrum k-spectrum+shape di-mismatch di-mismatch+shape di-mismatch di-mismatch+shape di-mismatch di-mismatch+shape m=1 m=1 m=2 m=2 m=3 m=3 k=1 256 0.8899 (+/-058) (+/-029) k=2 602 0.9789 (+/-090) (+/-026) k=3 0.9562 0.9843 915 0.9579 (+/-055) (+/-028) (+/-068) (+/-045) k=4 0.9767 0.9847 0.9772 0.9813 0.9572 0.9796 (+/-027) (+/-023) (+/-022) (+/-013) (+/-030) (+/-014) k=5 0.9805 0.9855 0.9815 0.9740 0.9822 0.9808 0.9798 0.9838 (+/-027) (+/-015) (+/-025) (+/-035) (+/-022) (+/-021) (+/-025) (+/-022) k=6 0.9717 0.9778 0.9752 0.9770 0.9854 0.9869 0.9874 0.9885 (+/-024) (+/-068) (+/-043) (+/-018) (+/-030) (+/-019) (+/-022) (+/-018) k=7 0.9766 0.9796 0.9797 0.9797 0.9876 0.9832 0.9860 0.9867 (+/-023) (+/-024) (+/-025) (+/-016) (+/-019) (+/-018) (+/-014) (+/-025) k=8 0.9751 0.9831 0.9761 0.9739 0.9865 0.9812 0.9884 0.9777 (+/-025) (+/-019) (+/-033) (+/-039) (+/-020) (+/-027) (+/-018) (+/-028) Supplementary Table 4: AUROC performance of distinguishing Exd-Antp and Exd-Scr binding sites in SELEX-seq data. Numbers in the parentheses represent the variance of AUROC scores among the five outer cross-validation folds. For each pairwise comparison between k-spectrum and k-spectrum+shape models and between di-mismatch and di-mismatch+shape models, higher AUROC scores are marked in bold font. The highest AUROC score among all models is colored in blue. 7

Supplementary Figures (a) 0.8 1.0 0 1 2 3 4 5 Antp normalized intensity Density Antp n = 39585 mean = 8 median = 7 (b) 0.7 0.9 0 1 2 3 4 Scr normalized intensity Density Scr n = 30872 mean = 8 median = 7 (c) 0.8 1.0 0.7 0.9 Antp normalized intensity Scr normalized intensity Antp & Scr common n = 15538 7239 positives 1460 negatives Supplementary Figure 1: Exd-Hox SELEX-seq data. (a) Histogram of normalized binding affinity values for the Exd-Antp SELEX-seq data. (b) Histogram of normalized binding affinity values for the Exd-Scr SELEXseq data. (c) Scatter plot of the normalized Antp and Scr binding affinities for the common sequences. Each dot represents one DNA sequence. Red dots are sequences with positive labels, which have relatively higher binding affinity for Antp. Green dots are sequences the negative labels, which have relatively higher binding affinity for Scr. 8

(a) (b) (11) (1) R 2 (1mer+shape) (5) (1) (4) (6) 0 0 R 2 (1mer) 0 0 (d) (c) (1) R 2 (2mer+shape) (11) (5) (1) (4) (6) 0 5 0 5 0 R 2 (2mer) (e) (f ) (11) (1) R 2 (3mer+shape) (5) (1) (4) (6) 0 5 0 R 2 (3mer) (g) (h) (11) R 2 (5mer+shape) 5 (1) (5) (1) (4) (6) 0 R 2 (5mer) 4 8 Supplementary Figure 2: R2 performance for k-spectrum model verse k-spectrum+shape model on upbm data set, k = 1, 2, 3, 5. (a)(c)(e)(g) Scatter plots of the R2 performance values between the two models. Each dot represents one TF, colored corresponding to its protein family. (b)(d)(f )(h) Barplots of R2 improvements for various protein families. Numbers in the parentheses are the number of DREAM5 TFs in each TF family. The x-axis shows the differences of R2 values between the two models. The length of the bars represent the mean of R2 differences and the error bars mark the standard error of the mean. 9

(a) (b) Percent of positive 0 20 40 60 80 100 Percent of positive 0 20 40 60 80 100 3 4 5 6 7 8 4 5 6 7 8 (c) (d) (e) (f) Average R 2 kmer kmer.1m Average R 2 kmer kmer.2m 3 4 5 6 7 8 4 5 6 7 8 k k Supplementary Figure 3: Comparison between k-spectrum and di-mismatch models. (a) Percent of DREAM5 TF datasets that have higher R 2 values using the di-mismatch model than using the k-spectrum model, for k = 3,, 8 and m = 1. (c) Differences of R 2 values between the two models, for k = 3,, 8 and m = 1. (e) R 2 performance scores of various k-specturm models (blue) and di-mismatch models (red), for k = 3,, 8 and m = 1. (b)(d)(f) Corresponding plots for di-mismatch paramters k = 4,, 8 and m = 2. 10

1nt 1nt+shape 2nt 2nt+shape 3nt 3nt+shape 1mer 1mer+shape 2mer 2mer+shape 3mer 3mer+shape R 2 Supplementary Figure 4: Comparison between positional kenels and compositional kernels on the pre-processed DREAM5 upbm data. The k-nt and k-nt+shape models represent the positional kernel approaches as described in Zhou et al. (2015); whereas k-mer and k-mer+shape models represent our compositional k- spectrum and k-spectrum+shape kernels. The probe preprocessing steps were described in Supplementary Note 2.1 and the pre-processed probe data was obtained from Zhou et al. (2015). The bar heights indicate the average R 2 values among 65 mouse TF upbm dataset and the error bars mark the standard error. 1nt 1nt+shape 2nt 2nt+shape 3nt 3nt+shape 1mer 1mer+shape 2mer 2mer+shape 3mer 3mer+shape R 2 Supplementary Figure 5: Comparison between positional kernels and compositional kernels on the raw DREAM5 data. The k-nt and k-nt+shape models represent the positional kernel approaches as described in Zhou et al. (2015); whereas k-mer and k-mer+shape models represent our compositional k-spectrum and k- spectrum+shape kernels. The bar heights indicate the average R 2 values among all 66 mouse TF upbm dataset and the error bars mark the standard error. 11

(1) (5) (1) (4) (6) 0 5 0 (1) 5 (5) (1) (4) (6) 0 4 8 (f ) (11) (1) (5) (1) (4) (6) 00 10 20 30 R 2 (5mer.1m) (g) (h) R 2 (5mer.2m+shape) R 2 (4mer.1m) (e) R 2 (5mer.1m+shape) (11) (1) (5) (1) (4) (6) 0 2 R 2 (5mer.2m) 4 (j) (i) (11) (1) R 2 (5mer.3m+shape) R 2 (3mer.1m) (11) R 2 (4mer.1m+shape) (11) R 2 (3mer.1m+shape) (d) (c) (b) (a) (5) (1) (4) (6) 0 R 2 (5mer.3m) 4 8 2 Supplementary Figure 6: R2 performance for di-mismatch model verse di-mismatch+shape model on upbm data set, k = 3, 4, 5, m = 1,, k 2. (a)(c)(e)(g)(i) Scatter plots of the R2 performance values between the two models. Each dot represents one TF, colored corresponding to its protein family. (b)(d)(f )(h)(j) Barplots of R2 improvements for various protein families. Numbers in the parentheses are the number of DREAM5 TFs in each TF family. The x-axis shows the differences of R2 values between the two models. The length of the bars represent the mean of R2 differences and the error bars mark the standard error of the mean. 12

(a) (b) 20 20 15 15 Count 10 Count 10 5 5 0 0 1mer 2mer 3mer 4mer 5mer 6mer 7mer 2mer.shape 3mer.shape 4mer.shape 5mer.shape 6mer.shape (c) 15 (d) 25 20 Count 10 Count 15 5 10 5 0 3mer.1m 4mer.1m 4mer.2m 5mer.2m 5mer.3m 6mer.1m 6mer.2m 6mer.3m 7mer.2m 7mer.3m 8mer.2m 8mer.3m 3mer.1m.shape 4mer.1m.shape 4mer.2m.shape 5mer.1m.shape 5mer.2m.shape 5mer.3m.shape 6mer.1m.shape 7mer.1m.shape 7mer.2m.shape 7mer.3m.shape 8mer.2m.shape 8mer.3m.shape 0 Supplementary Figure 7: Distribution of best k (and m) parameters in each kernel for the mouse DREAM5 dataset. (a) Histogram of best k-spectrum kernels. (b) Histogram of best k-spectrum+shape kernels. (c) Histogram of best di-mismatch kernels. (d) Histogram of best di-mismatch+shape kernels. 13