Article Predicting Variation of DNA Shape Preferences in Protein-DNA Interaction in Cancer Cells with a New Biophysical Model

Size: px
Start display at page:

Download "Article Predicting Variation of DNA Shape Preferences in Protein-DNA Interaction in Cancer Cells with a New Biophysical Model"

Transcription

1 Article Predicting Variation of DNA Shape Preferences in Protein-DNA Interaction in Cancer Cells with a New Biophysical Model Kirill Batmanov and Junbai Wang * Department of Pathology, Oslo University Hospital Norwegian Radium Hospital, Montebello, 0310 Oslo, Norway; Kirill.Batmanov@rr-research.no * Correspondence: junbai.wang@rr-research.no Received: 31 July 2017; Accepted: 13 September 2017; Published: 18 September 2017 Abstract: DNA shape readout is an important mechanism of transcription factor target site recognition, in addition to the sequence readout. Several machine learning-based models of transcription factor DNA interactions, considering DNA shape features, have been developed in recent years. Here, we present a new biophysical model of protein DNA interactions by integrating the DNA shape properties. It is based on the neighbor dinucleotide dependency model BayesPI2, where new parameters are restricted to a subspace spanned by the dinucleotide form of DNA shape features. This allows a biophysical interpretation of the new parameters as a position-dependent preference towards specific DNA shape features. Using the new model, we explore the variation of DNA shape preferences in several transcription factors across various cancer cell lines and cellular conditions. The results reveal that there are DNA shape variations at FOXA1 (Forkhead Box Protein A1) binding sites in steroid-treated MCF7 cells. The new biophysical model is useful for elucidating the finer details of transcription factor DNA interaction, as well as for predicting cancer mutation effects in the future. Keywords: transcription factors; DNA shape; protein DNA interaction 1. Introduction Understanding how transcription factors (TFs) recognize their target DNA binding sites is an important task in the study of gene regulation. Although a complete model of this process is currently out of reach [1], the growing body of experimental data enables the development of many approximate models to compute TF DNA binding affinity. There are models that aim at identifying proteins that may bind to DNA based on the protein amino acid sequences (e.g., ndna-prot [2]) or models that focus on the prediction of protein target sites (e.g., BayesPI2 [3]). The latter ones are useful in identifying functional TF binding sites, predicting the effects of mutations on gene regulation [4], and elucidating the differences between related TFs [5]. Current approaches to estimate the TF DNA binding affinity can be broadly divided into two categories: one is the black box approach, which uses powerful machine learning techniques (e.g., support vector machines, random forest, neural network or ensemble methods such as LibD3C [6]) with as many input features as possible, to achieve the most accurate affinity predictions [7 10]; and the other is the biophysical modeling approach, which derives physical models from first principles by using well-understood approximations in statistical physics [11 13]. Generally, machine learning methods have high accuracy after considering many diverse input features, but they do not consider the domain knowledge. Thus, it is difficult to interpret the results that are learned from training data. Additionally, experimental biases [14,15] may be unintentionally learned by the model parameters, which makes a reliable test prediction only happen for experiments with the same configuration as Genes 2017, 8, 233; doi: /genes

2 Genes 2017, 8, of 15 that of training ones. In this work, we pursue the second approach, biophysical modeling, which produces interpretable model parameters relevant to the TF DNA interaction mechanisms. The trained model parameters have clear definitions in theory. Due to such advantages, the new biophysical model can be adjusted and reused in different settings, and its parameter values can be compared between models. Many of the previous biophysical models of TF DNA interactions do not consider the contribution of dinucleotide dependency in TF binding sites [13,16]. The inferred TF binding target motif is parameterized by a position-specific weight matrix (PWM), which is simple and interpretable. The independent TF binding model is often used to investigate biological phenomena in gene regulation. For example, a PWM-based biophysical model, BayesPI2, has been applied to study subtle transcription factor binding patterns and gene regulation effects in cancer cells [17]. Another similar program, BayesPI-BAR, has been used to predict the significance of mutation effects on TF DNA binding, which led to the discovery of new regulatory mutations that cause gene dysregulation in follicular lymphoma [18]. In addition to the nucleotide sequence, local DNA structure properties, such as the geometry of the DNA molecule and base stacking energy, are known to affect the protein DNA interaction [19 21]. Using such properties as input features together with the DNA sequence has improved TF DNA binding prediction accuracy in several machine learning-based models [9,22 24]. However, until now, there is no biophysical model incorporating DNA structure information. Several works have considered general dinucleotide dependency features [3,11,12], which implicitly contain DNA shape information [22]. The results have been modest, with some evidence showing that the dinucleotide dependency models do not generalize well to in vivo data [25]. Here, we have developed a new biophysical model that considers DNA structure features as a special form of position-specific dinucleotide dependency. This model has its parameters restricted to a space defined by the dinucleotide combination of DNA structure properties derived from the DiProDB database [26]. We compare the performance of the independent PWM model, the full dinucleotide model and the DNA shape-restricted dinucleotide model on several tasks such as single nucleotide variant (SNV) effect analysis, which requires accurate TF binding affinity prediction. Unlike the machine learning approaches, the new biophysical model equipped with interpretable model parameters and the training structure, the inferred shape feature preferences can be further studied in various conditions. It gives us an opportunity to investigate the dynamical change of DNA shape preferences in different cancer cell lines. 2. Materials and Methods 2.1. ChIP-Seq Data The ChIP-seq data used to fit the TF DNA binding affinity models was downloaded from the Encyclopedia of DNA Elements (ENCODE) project [27]. We use peaks from the uniform peak calling pipeline. The ERα (Estrogen Receptor Alpha) time series ChIP-seq is obtained from Gene Expression Omnibus (GEO), Accession GSE FOXA1 ChIP-seq for different conditions is obtained from [28] (GEO Accession GSE72249). We take up to 1000 peaks with the highest signal value from each experiment, and for each peak, a 100 bp sequence centered on the called peak is extracted from the hg19 reference genome. These sequences form the positive set of samples, which are assumed to frequently contain the binding motif for the corresponding TF. To fit an affinity model, our methods also need a set of negative samples. This set consists of 100 bp sequences taken randomly from genome regions near transcription start sites (TSS) of known genes (TSS ± 10 kbp). The peak regions are excluded from the negative set, and the number of sequences in each negative set is the same as the size of the positive set Allele-Specific Binding Data The dataset of allele-specific TF binding events was downloaded from a recent publication [29]. It was derived from raw reads of ENCODE ChIP-seq experiments for 36 TFs in several cell lines,

3 Genes 2017, 8, of 15 where SNVs were first identified and then assessed for possible allele-specific binding based on the number of reads containing each variant BayesPI Transcription Factor DNA Binding Affinity Model The basic biophysical model of TF DNA binding affinity, called BayesPI, was first presented in [13] and later extended in [3] to include the contribution of interactions between neighboring nucleotides in the DNA. The expression for the probability of TF binding to a small DNA segment is: N M 1 P(S, w, μ) = 1 + e E(S i:i+m,w) μ i=0 where S i,a = 1 if the DNA sequence has nucleotide a (one of A, C, G, T) at position i and S i,j = 0 otherwise, N is the sequence length, M is the length of the binding motif (the number of consecutive base pairs that affect the affinity), and μ is the chemical potential of the TF, which is defined by its concentration in the nucleus. The binding energy of the TF to a short DNA fragment with length M bp is represented by E, which is the sum of the independent contributions of each nucleotide: M 1 E indep (S, w) = w j,a S j,a 4 j=0 a=1 The matrix w R M 4, called the position-specific affinity matrix, specifies these contributions: w j,a is the binding energy of nucleotide a at position j inside the DNA fragment. The protein binding energy may consider the contribution of dinucleotide dependence: M 2 4 E dinuc (S, d) = d j,(a 1) 4+b S j,a S j+1,b 4 j=0 a=1 b=1 where d R M 1 16 is the matrix of pairwise dependency energy correction, with d j,(a 1) 4+b specifying the correction of the independent energy terms for the dinucleotide at positions j: j + 1 with nucleotides a and b DNA Shape Affinity Model The TF DNA binding affinity model that takes into account DNA shape information includes dinucleotide dependencies in a more structured form, following prior information about DNA molecule characteristics, which may be important for the protein DNA interaction. The affinity is modeled as: N M P(S, F, w, d f 1, μ) = 1 + e E indep (S i:i+m,w)+e shape (F i:i+m,d f ) μ i=0 where F R M K is the matrix of DNA shape feature values at each position inside the sequence S and K is the number of different DNA shape features considered. F i,j is the value of DNA shape feature j at position i, which is a number characterizing a certain property of the DNA molecule at that location. Analogous to the independent and dinucleotide models, DNA shape contributes linearly to the binding energy: M 1 E shape (F, d f ) = d f j,k F j,k K j=0 k=1 where d f R M K is the matrix of position-specific DNA shape preference of the TF, expressed as a correction to the independent nucleotide preferences. The 2 mer shape features used in this work

4 Genes 2017, 8, of 15 come from the DiProDB database [26], which lists thermodynamic, structural and some other properties of DNA and RNA dinucleotides. We use four features: twist, minor groove width, propeller twist and roll. These features are chosen following an earlier DNA shape publication [30], where the same DNA shape characteristics are used in TF DNA affinity modeling. However, the DNA shape features in [30] are based on 5 and 6 mer sequences, while DiProDB uses dinucleotides only. In the dinucleotide case, F can be expressed as: 4 4 F j,k = D k,(a 1) 4+b S j,a S j+1,b a=1 b=1 where D R K 16 is the matrix of K DNA shape feature values for each dinucleotide. In our model, K = 4. Plugging the expression for F into the definition of E shape, we observe that, as it is defined here, the DNA shape energy contribution E shape and the dinucleotide dependency energy contribution E dinuc are linearly related: E shape (F, d f ) = E dinuc (S, d f D) In other words, the DNA shape model is a dinucleotide dependency model whose coefficients are restricted to a linear subspace spanned by the shape features D. This is a useful property that allows efficient computation of affinity without the need to compute DNA shape features explicitly for each sequence. Furthermore, any software that works with dinucleotide models, such as BayesPI2, can now be used to work with DNA shape models by expanding d f coefficients into full dinucleotide coefficients d = d f D Bayesian Inference of Model Parameters for the DNA Shape Restricted Dinucleotide Dependence Model The parameters of TF DNA affinity models (w, μ and d f ) are fitted to the experimental data, such as ChIP-seq peaks or protein binding microarray probes, using gradient descent, starting from a randomized initial seed. Because the experimental data typically contain measurement noise, a proper regularization of the model is important [31]. This is done by applying Bayesian inference to the L2 regularization hyperparameters, hence the name BayesPI. The Bayesian inference proceeds in iterations of sequential estimation of model parameters, such as w, μ and d f, and hyperparameters, which are L2 regularization coefficients α i and the error scaling coefficient β. When fitting the model parameters, the gradient descent is performed on the following regularized cost function: L = βl D (w, d f, μ, a, b) + α i W i 2 L D (w, d f, μ, a, b) = 1 2 ((a P(S i, w, μ, d f ) + b) t i ) i where t i is the target value (e.g., normalized ChIP-seq tag count) for sequence S i, a and b are coefficients of the linear transformation applied to the binding probability in order to match its distribution to the distribution of t i and W is the sequence of all model parameters: w, d f, μ, a and b. On the first iteration, the regularization hyperparameters β and α i are set to constant values corresponding to a weak regularization. After the gradient descent has converged, the hyperparameter values are updated using the following formulas: α i = γ i W i 2 i 2 β = N γ i i 2L D

5 Genes 2017, 8, of 15 γ i = 1 α i H ii 1 where N is the number of sequences and H = J( L) T is the Hessian matrix of the cost function, evaluated at the converged values of model parameters, and J is the Jacobian matrix (see the Supplementary Methods for the formulas to compute the Hessian). Since γ i and H depend on β and α i, the hyperparameter update is also performed iteratively, repeatedly recomputing values of γ i and H, followed by recomputation of β and α i, until convergence. The values of model parameters W are kept fixed during the hyperparameter update computation. After the hyperparameters have been updated, we again perform the gradient descent on the model parameters, starting from their values on the previous iteration. The whole process is repeated several times, usually converging in 3 10 iterations. The derivation of this algorithm, based on Bayesian treatment of the uncertainty of the hyperparameters, is given in [32]. In BayesPI, the α i hyperparameters are grouped together and their values shared between classes. All weights of the independent part of the model, w, share a single α w (please refer to the Supplementary Methods for computational details of this procedure). The dinucleotide DNA shape feature weights f at the same position p, d p,k also share the same regularization weight α d,p. The reason is that the information content of the motif is distributed unequally across positions, with some positions being important while others affecting the affinity weakly. To avoid the overfitting of larger shape model parameters, they should be more strongly regularized at unimportant positions. This is achieved automatically by the Bayesian regression procedure. Since all dinucleotide weights at the same position share a single hyperparameter value, they will all be regularized with the same strength. Position-based regularization allows restricting the dependency parameters at positions where the dependency does not seem to contribute to the affinity, while letting the parameters grow at more important positions. The model is typically fit in two stages: first, the independent parameters w are fit, setting d f = 0. Then, w is fixed, and the dependent parameters d f are fit. Thus, the dinucleotide shape-restricted dependency parameters d f should be viewed as adjustments of the independent model, in places where dependencies are significant. d f = 0 recovers the original independent model Mutation Effect Prediction by Using Various Biophysical Models We use the BayesPI-BAR [4] method to evaluate how a DNA variant affects TF binding. BayesPI-BAR computes a score called shifted differential binding affinity (δdba) for each variant and TF. δdba measures the difference in above-background binding strength between the reference and alternate sequences, which is equivalent to measuring the effect of the sequence variant on TF binding. If δdba > 0, then there is an increase of TF binding affinity in the alternate sequence compared to that in the reference sequence: creation of a new TF binding site or strengthening of an existing weak binding site. If δdba < 0, then an existing TF binding site is disrupted by the variant. In the previous works, we only considered the independent binding affinity model for TFs in BayesPI-BAR. Here, we use independent, full dinucleotide dependence, and newly-developed DNA shape-restricted dinucleotide dependence models to evaluate the prediction of the mutation effect in cancer cells Code Availability The new BayesPI2Shape software contains independent, dinucleotide dependence, and DNA shape-restricted dinucleotide dependence models (preselected DNA shape features and the visualization function) are available from 3. Results 3.1. Validation of Inferred DNA Shape Feature Preferences for Protein DNA Interaction

6 Genes 2017, 8, of 15 The main advantage of the new shape-restricted TF DNA affinity model is the interpretability of the model parameters. The matrix d f models the adjustments of the binding energy, due to the f preferred DNA shape feature at a specific TF binding position: a positive value of d j,k indicates that the higher the values of shape feature k at position j, the higher the TF binding affinity; but a negative value indicates that a low value of shape feature is preferred at the TF binding position. Thus, we call d f the (DNA) shape preference matrix. To validate the fitted shape preferences, we tested the new method on three TFs (Serum Response Factor (SRF), Myocyte Enhancer Factor 2C (MEF2C), and TATA-Box Binding Protein (TBP)) for which the DNA shape preferences were previously reported. In this test, we use PWMs from JASPAR and fit only d f using the ChIP-seq datasets from ENCODE. In Figure 1A, a heatmap of the shape feature preferences of SRF is shown. The most pronounced shape feature preference is for higher propeller twist at position 6, which matches the findings in [9]. The exact position may be shifted a few base pairs compared to Figure 7 in [9], because 5 and 6 mer DNA shape features are used in the previous work, while we use only dinucleotide shape features. Figure 1. Shape feature preferences for Serum Response Factor (SRF), Myocyte Enhancer Factor 2C (MEF2C), and TATA-Box Binding Protein (TBP). The heatmaps show the preference for each shape feature at each binding position (the inferred d f matrix), with the preference for low and high feature values in blue and orange, respectively. The shape preferences are split into positive and negative parts accordingly. Grey color represents weaker preference. The position-specific weight matrix (PWM) logos of transcription factors (TFs) are shown above the shape preference heatmaps. In Figure 1B, the heatmap for MEF2C is shown, where propeller twist and roll have strong preferences in the current prediction. The two shape features are also the most important features for MEF2C reported in [9], although their locations and strength are slightly different. It is noteworthy that in [9], the relative feature importance score given by the random forest model was used to predict DNA shape feature preferences. This score cannot be compared across different models. On the other hand, our biophysically-modeled DNA shape preferences d f directly reflect the dynamical modification of TF binding affinities in the change of DNA shape features. In Figure 1C, our predicted TBP shape feature preferences are shown, where minor groove width is the main dependency term. The result is consistent with the fact that TBP recognizes its target sequence by binding to the minor groove [33] ChIP-Seq Peak Prediction Here, we evaluate the performance of three affinity models (the independent model, the full dinucleotide model and the DNA shape-restricted dinucleotide model) to predict ChIP-seq peaks,

7 Genes 2017, 8, of 15 using the ENCODE data. First, three independent models with motif sizes of 10, 15 and 20 were fitted for each of 36 TFs. We retain only the models that fit the training data well: r 2 of prediction for the raw signal is greater than 0.4, the area under receiver operating characteristic curve (AUC) >0.75 for distinguishing true peak sequences from the background ones. Then, we fit the dinucleotide interaction models and DNA shape models as corrections to the independent models on the same training data. The prediction results on an independent test set are shown in Figure 2. The AUC of each independent model is plotted against the AUC of the same model with a dependency correction. Both the dinucleotide dependent and the DNA shape-restricted models improve the accuracy of peak prediction significantly, which are compared to that by the independent model. In the current study, the full dinucleotide model (Figure 2A) usually has much higher accuracy than the DNA shape model (Figure 2B). The improvement of TF binding motif prediction by using the nucleotide dependent models (e.g., dinucleotides, k-mers and DNA shape features), on a homogeneous dataset such as ChIP-seq or protein binding microarray data, has been demonstrated before with black box machine learning techniques [9,12,25]. Here, we show that it also holds true for the new biophysical model. Figure 2 indicates that the full dinucleotide model is superior to the DNA shape-restricted dinucleotide model. However, such outcomes need to be considered with caution. That is because an experiment such as ChIP-seq is known to contain biases in the measurements. These biases do not reflect the underlying true biochemical processes, but may belong to a systematic error from the experimental procedure. Such errors may be learned by a powerful machine learning method. For example, the full dinucleotide model may learn a spurious dependency originating from an experiment-specific bias, which leads to the model only performing well in data generated from the same type of experiments. To fully assess the usefulness of a new model, it needs to be validated by different types of experimental datasets (e.g., a model trained on in vivo data, but tested on in vitro data). Figure 2. ChIP-seq peaks prediction is improved in dependency of the models. For each of the 172 ChIP-seq datasets for 36 TFs, the area under receiver operating characteristic curve (AUC) of the independent model is compared to (A) the full dinucleotide model, which has a separate tunable parameter for each of 16 possible dinucleotides at each position, and (B) the DNA shape-restricted dinucleotide model, which has a separate tunable parameter for each shape feature (e.g., four DNA shape features in the current study) at each position Investigating the Variation of Predicted DNA Shape Preferences across Cell Types The parameters of the new DNA shape model have a straightforward interpretation, similar to f that of the dinucleotide dependency model: each d j,k is the weight of the preference of the TF for

8 Genes 2017, 8, of 15 the shape feature k at position j, in addition to that defined by the independent model weights w. This enables an easy comparison of fitted shape parameters between different conditions. Here, we explore the variation of fitted shape model parameters, across different cancer cell lines. Our model is based on the hypothesis that the DNA shape readout by a TF depends only on the TF itself and the local DNA sequence. This assumption may be violated in vivo due to epigenetic DNA modifications [23,34], nucleosome positioning in the DNA sequence [35] or because of the presence of cofactors [20]. In order to assess the influence of these external factors on the shape model fitting procedure, we compared shape features predicted in different cell lines and different conditions for the same TF. We use canonical PWMs from the JASPAR web database [36] for 23 TFs, where ENCODE ChIP-seq datasets are available for more than one cell line. For each TF, we fit the shape model parameters for each ChIP-seq dataset separately and compute the correlation of the fitted shape preference parameters between different cell lines. The distribution of the median correlation coefficients across TFs is shown in Figure 3A. Three of the TFs (Basic Helix-Loop-Helix Family Member E40 (BHLHE40,) MYC Associated Factor X (MAX), and JunD Proto-Oncogene (JUND)) have very low shape feature correlations between the conditions (e.g., <0.2). The PWM that we used for JUND has a poor performance for distinguishing the true peaks from the background ones: AUC is This suggests that if the independent model cannot predict TF bindings well, then the result of the shape model is also unreliable. On average, the correlation of predicted shape preferences among different cell lines is high, with the median of them being around 0.7. The shape preference seems stable in in vivo conditions. In Figure 3B,C, the fitted shape feature variations in three conditions are given for E74 Like ETS Transcription Factor 1 (ELF1) (median correlation 0.7) and Upstream Transcription Factor 2 (USF2) (median correlation 0.67), respectively, where each shape model matrix d f is displayed by a heatmap (positive preferences). The negative preferences are shown in Supplementary Figures S2A,B. In Figure 4, MAX and BHLHE40 have nearly identical PWMs with very pronounced 6 bp core motifs, and the independent parts of the binding affinity models are very similar, though the inferred shape preferences for MAX and BHLHE40 differ significantly across cell lines (see Supplementary Figure S1 for the positive shape preferences). Thus, the predicted shape preferences may be used to distinguish the true binding sites between MAX and BHLHE40. Figure 3. Shape model parameters in different conditions. (A) Distribution of median correlation coefficients for 23 TFs with multiple ChIP-seq datasets. (B) Shape feature preferences for ELF1 in three cell lines. The heatmaps show the preference for each shape feature at each position (the inferred d f matrix). Only the positive preferences are shown, that is, the preferences towards higher values of shape features. Colors faded to grey mean weaker preference. (C) Shape feature

9 Genes 2017, 8, of 15 preferences for USF1 in three cell lines. The PWM logos are shown above the predicted shape preference heatmaps. Figure 4. Predicted shape preferences of MYC Associated Factor X (MAX) and Basic Helix-Loop-Helix Family Member E40 (BHLHE40) in three cell lines. The heatmaps show the preference for each shape feature at each position (the inferred d f matrix). Only the negative preferences are shown here, that is the preferences towards lower values of shape features. Colors faded to grey mean weaker preference. The PWM logos are shown above the shape preference heatmaps Variation of DNA Shape Preferences across Cellular Conditions Changing cellular conditions may result in changing gene expression, which is caused by fine-tuning of TF DNA interaction patterns. This adjustment is tightly controlled and thus may be reflected in the DNA shape feature preferences of affected TFs. Our new model allows identifying such changes by inferring shape preferences from ChIP-Seq data with different conditions. Here, we test the new model on the estradiol (E2)-treated MCF7 breast cancer cell line. It has been observed that the response of treatment includes chromatin reorganization [37], which prompts the investigation of possible DNA shape preference changes in the TFs involved. We used two public ChIP-seq datasets (GEO GSE94023 and GSE72249 [28]) to study this system. In Figure 5A (and Supplementary Figure S3A for negative preferences), the evolution of shape preferences of estrogen receptor (ERα/ESR1) is shown at different time points after the treatment of MCF7 cells with E2. After E2 stimulation, ERα enters the nucleus and binds to specific sequences in the DNA called estrogen response elements. The shape preferences of ERα remain relatively stable across time. After 320 minutes, the shape preference becomes weak, likely due to reduced signal as there are much fewer strong peaks. FOXA1 is a pioneer factor, that is it can bind to condensed chromatin and make it accessible to other TFs. As such, it must be precisely targeted to particular DNA sequences. It is known that its binding patterns are affected by the presence of other TFs, in particular ER and the glucocorticoid receptor (GR) [28]. These interactions are critically important for tumorigenesis of breast and prostate cancers [38]. Thus, it is interesting to see whether the change of FOXA1 binding affinity can be observed in the DNA shape preferences or not. In Figure 5B (see Supplementary Figure S3B for positive preferences), the inferred shape preferences of FOXA1 are shown in three conditions of MCF7 cells: untreated, treated with dexamethasone (Dex, a glucocorticoid) and treated with E2. Shape preferences in both treated conditions are similar to each other and slightly different from that

10 Genes 2017, 8, of 15 in the untreated condition, with additional preferences appearing at dinucleotide 8. This indicates that the modulation of FOXA1 binding by ER and GR involves changes of DNA shape. Figure 5. Condition-specific DNA shape preferences in protein-dna interaction. (A) Variation of inferred shape preferences of ERα at different time points after the treatment with estradiol (E2) in the MCF7 cell line. (B) Variation of inferred shape preferences of FOXA1 after the treatment with either dexamethasone (Dex) or E2 in the MCF7 cell line. The heatmaps show the preference for each shape feature at each position (the inferred d f matrix). Only the positive preferences are shown in (A), and only the negative preferences are shown in (B). Colors faded to grey mean weaker preference. The PWM logos are shown above the shape preference heatmaps Allele-Specific Binding Prediction We used the BayesPI-BAR algorithm to study whether the prediction of TF binding affinity changes is affected by SNVs. The SNVs used in this study were inferred from the ChIP-seq data, by analyzing the raw reads for the presence of allele-specific binding (ASB) events [29]. There are 36 datasets, one per TF, with SNVs that are marked as either ASB (the TF binding is affected by the SNV) or non-asb (no significant effect is observed). The task is to predict whether an SNV causes ASB or not, given the reference and alternate DNA sequence. BayesPI-BAR can solve the problem by predicting the TF binding affinity change between the two sequences with the known TF affinity model. The higher the absolute value of the predicted change, the greater the chance of an ASB event. We first tested the accuracy of baseline PWM models. BayesPI-BAR uses several alternative PWMs for the same TF simultaneously, in which case the predicted δdba scores for each PWM are averaged. We used several sets of PWMs in this test: (1) 26 PWMs from the JASPAR database, one

11 Genes 2017, 8, of 15 for every TF is available; (2) 129 PWMs for 28 TFs from the database of 1772 PWMs that was used in the original BayesPI-BAR publication; (3) 112 PWMs for 33 TFs inferred by the BayesPI2 motif discovery program based on ChIP-seq datasets with motif sizes of 10, 15 and 20. The results of these tests are shown in Figure 6. We use the area under the precision-recall curve (AUPRC) as the measure of prediction performance due to class imbalance in the ASB dataset. The median AUPRC of BayesPI-BAR is 0.32, which is comparable to the accuracy of other machine learning approaches in [29] (e.g., 0.35). However, it is significantly higher than a previous report in [29], where it was Such a performance discrepancy of BayesPI-BAR in the same datasets may be caused by misusing of principal component analysis (PCA) scores in the earlier work [29], instead of using the mean δdba scores in the current work. Figure 6. Allele-specific binding (ASB) prediction accuracy by various models. Distribution of the area under the precision-recall curve (AUPRC) for predictions of ASB events, for BayesPI-BAR, deltasvm and a random forest-based sequence model from [29]. The PWM sets used in BayesPI-BAR are: 26 PWMs from the JASPAR database; the relevant PWMs from the set of 1772 human TF PWMs in the original database supplied with BayesPI-BAR; a set of PWMs that could be successfully inferred by the BayesPI motif discovery program using the Encyclopedia of DNA Elements (ENCODE) ChIP-seq data. Data for models marked with a star are reproduced, approximately, from [29]. After confirming that the performance of BayesPI-BAR is satisfactory for the ASB data by considering the independent model only, we compared the performance between the independent and dependent models. For each independent model of a TF, we infer the dependency correction parameters (either the full dinucleotide matrix or the shape feature preferences) using ChIP-seq datasets. The results for the available 36 TFs are summarized in Figure 6 and Supplementary Figure S4. In testing the dependency models, we use the same set of independent models and add the dependency energy terms when they are available. However, neither the full dinucleotide model, nor the DNA shape-restricted model improves ASB prediction accuracy over that by the independent model. For example, there is a big improvement in prediction accuracy in some TFs,

12 Genes 2017, 8, of 15 but a negative impact on the other TFs. Thus, there is no advantage of using dependent models for ASB event prediction in the current study. 4. Discussion We have developed a new biophysical TF DNA interaction model that takes into account DNA shape-restricted dinucleotide dependencies. The new model restricts the parameter space of the dinucleotide dependencies to a reduced subspace, which considers only the dinucleotide DNA shape properties. Such an implementation makes the model biophysically interpretable and can be used to investigate the TF DNA interaction mechanism in various circumstances, by examining the predicted model parameters. It has been previously reported that models including nucleotide dependency parameters can fit datasets (e.g., in protein binding microarray probes, systematic evolution of ligands by exponential enrichment (SELEX) sequences and ChIP-seq peaks) better than the independent models. However, the question of whether the learned dependency features represent the actual TF binding preference or capture a subtle bias of an experimental error remains open. In the DREAM5 (TF-DNA Motif Recognition Challenge) [25], the biophysical models including dinucleotide interactions generally have better performance than the independent ones, when tested on the same type of data such as the in vitro experiment that was used for both training and testing data. On the contrary, the relative ranking of the independent and the dependent models is the opposite when models were trained on in vitro datasets, but tested on in vivo ones. Such a problem is later shown to be solved by a new dinucleotide model (FeatureREDUCE), by using a refined regularization optimization procedure [11]. However, the conclusion of FeatureREDUCE is based only on one TF. Here, we have used ChIP-seq datasets to fit dinucleotide dependency models by a Bayesian regression procedure, which aims to robustly estimate the model parameters when the input data contain noise. The use of in vivo data for training enables the models to infer the model parameters based on true conditions of TF DNA binding. For the new shape model, we have compared the inferred shape preferences of a few TFs to that of previously-reported results, which has a reasonable match between the two (Figure 1). The new biophysical model with DNA shape preference features improves the accuracy of ChIP-seq peak prediction, when compared to the baseline independent model (Figure 2). We also looked into the issue of whether DNA shape preferences are largely independent of the cell types or not, by testing the new DNA shape model in ENCODE ChIP-seq data under various cell lines (Figure 3). In some cases, the inferred shape preferences changed between the conditions such as MAX and BHLHE40 TFs (Figure 4). The aforementioned two TFs can bind DNA either individually or as heterodimers with other TFs [39,40], which may explain the variability of the shape preferences under different conditions. Although the core PWM models of MAX and BHLHE40 are very similar, their predicted shape preferences are quite different, which may be used to distinguish the true binding target sites. This is a very interesting observation. Next, we explored the possibility of the dynamical change of DNA shape preferences in TF DNA interactions under various conditions, such as the response of MCF7 breast cancer cells to steroid treatment. In the current work, the inferred shape preferences of ERα have little change at different time points after the E2 treatment, except for the last few time points where the number of ChIP-seq peaks is reduced significantly (Figure 5A). Nevertheless, FOXA1 gained new DNA shape dependencies after E2 treatment, and the same additional shape dependencies are also observed after treatment with Dex (Figure 5B). Thus, the target binding sites of FOXA1 have slightly different DNA shape preferences, which indicates TF cofactors are involved in the FOXA1 binding and may influence the DNA geometry differently during the interaction between ER, GR and FOXA1. The new shape model gives us an opportunity to study in detail the intricate TF DNA interaction patterns and to find a possible role of DNA shapes in different genome regulations. Finally, we used the BayesPI-BAR framework to assess the power of the DNA shape model in predicting genomic mutation effects. It was tested on an ASB dataset derived from ENCODE ChIP-seq data, in which the BayesPI-BAR with various nucleotide dependency models performed

13 Genes 2017, 8, of 15 reasonably well (Figure 6). It is worth noting that a clear understanding of the in silico prediction program is needed before applying it on any computational biology problems. For example, based on the same ASB data, a previous study reported a much lower prediction accuracy of BayesPI-BAR than that of the machine learning methods [29], although our reanalysis shows that BayesPI-BAR achieves a similar accuracy as the other methods (Figure 6). This is because the authors of [29] might have used a post-processed TF ranking score (principal component scores, PCA) as the indicator of effect size (TF binding affinity changes). Here, we used the mean δdba (differential binding affinity) as a direct measure of TF binding changes, which correctly represents the TF binding data. The δdba can be reused in the testing set or be compared between different conditions, but this is not the case for PCA scores. Nevertheless, the proposed new shape model did not, on average, improve the prediction accuracy of the mutation effect on TF-DNA binding over that of the independent model, as evidenced by ASB data. This may be due to the limitation of the present model (e.g., BayesPI-BAR does not consider the geometry changes of the DNA molecule when it is wrapped around a nucleosome). Generally, nucleosome core particles interact with the DNA, bending it nonuniformly and in a sequence-dependent manner, which affects the DNA shape features [35]. The DNA shape features in turn may influence the preferential location of nucleosomes [41]. These interdependencies between the DNA shape and nucleosomes, the dynamic positioning of nucleosomes, the TF-specific interaction with nucleosomes and the lack of true information on nucleosome positions make it difficult to build a biophysical model with the consideration of the nucleosome effects in the TF DNA interaction. Thus, the nucleosome-related adjustments to the DNA shape are treated as noise in the current model; a future study to overcome this limitation by considering more genomic data and a refined model shall be carried out. Especially, more systematic investigation is needed to access the contribution of DNA shape in regulatory mutation studies. In conclusion, the new DNA shape enhanced biophysical model enables the investigation of additional aspects of TF DNA binding. Given the encouraging results in ChIP-seq peak prediction, a further development of the nucleotide dependency models, including the DNA shape preferences model, is a promising direction for future research. Such models, which can substantially improve the prediction accuracy of mutation effects, will lead to a better understanding of mutation-induced genome dysregulation in diseases such as cancer. Supplementary Materials: The following are available online at Figure S1: Fitted shape preferences of MAX and BHLHE40 for different cell lines (positive part). Figure S2: Shape model parameters in different conditions (negative part). Figure S3: Changes in shape preferences. Figure S4: ASB prediction of the full dinucleotide and shape-restricted models compared to the independent model. Acknowledgments: This work was supported by the Norwegian Cancer Society (DNK , DNK , and DNK ); the South-Eastern Norway Regional Health Authority (HSØ ); and the Norwegian Research Council NOTUR project (nn4605k). Author Contributions: K.B. carried out data analysis, contributed tools for the study and drafted the manuscript. J.B.W. conceived of and supervised the study. J.B.W. participated in the data analysis, contributing tools and writing the manuscript. All authors read and approved the contents of the final version of the manuscript. Conflict of Interests: The authors declare no conflict of interest. References 1. Slattery, M.; Zhou, T.; Yang, L.; Dantas Machado, A.C.; Gordan, R.; Rohs, R. Absence of a simple code: How transcription factors read the genome. Trends Biochem. Sci. 2014, 39, Song, L.; Li, D.; Zeng, X.; Wu, Y.; Guo, L.; Zou, Q. ndna-prot: Identification of DNA-binding proteins based on unbalanced classification. BMC Bioinform. 2014, 15, Wang, J. Quality versus accuracy: Result of a reanalysis of protein-binding microarrays from the DREAM5 challenge by using BayesPI2 including dinucleotide interdependence. BMC Bioinform. 2014, 15, Wang, J.; Batmanov, K. BayesPI-BAR: A new biophysical model for characterization of regulatory sequence variations. Nucleic Acids Res. 2015, 43, e147.

14 Genes 2017, 8, of Abe, N.; Dror, I.; Yang, L.; Slattery, M.; Zhou, T.; Bussemaker, H.J.; Rohs, R.; Mann, R.S. Deconvolving the recognition of DNA shape from sequence. Cell 2015, 161, Lin, C.; Chen, W.; Qiu, C.; Wu, Y.; Krishnan, S.; Zou, Q. LibD3C: Ensemble classifiers with a clustering and dynamic selection strategy. Neurocomputing 2014, 123, Ghandi, M.; Lee, D.; Mohammad-Noori, M.; Beer, M.A. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput. Biol. 2014, 10, e Mathelier, A.; Wasserman, W.W. The next generation of transcription factor binding site prediction. PLoS Comput. Biol. 2013, 9, e Mathelier, A.; Xin, B.; Chiu, T.P.; Yang, L.; Rohs, R.; Wasserman, W.W. DNA Shape features improve transcription factor binding site predictions in vivo. Cell Syst. 2016, 3, Alipanahi, B.; Delong, A.; Weirauch, M.T.; Frey, B.J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 2015, 33, Riley, T.R.; Lazarovici, A.; Mann, R.S.; Bussemaker, H.J. Building accurate sequence-to-affinity models from high-throughput in vitro protein-dna binding data using Feature REDUCE. elife 2015, 4, doi: /elife Zhao, Y.; Ruan, S.; Pandey, M.; Stormo, G.D. Improved models for transcription factor binding site identification using nonindependent interactions. Genetics 2012, 191, Wang, J.; Morigen. BayesPI A new model to study protein-dna interactions: A case study of condition-specific protein binding parameters for Yeast transcription factors. BMC Bioinform. 2009, 10, Ramachandran, P.; Palidwor, G.A.; Perkins, T.J. BIDCHIPS: Bias decomposition and removal from ChIP-seq data clarifies true binding signal and its functional correlates. Epigenetics Chromatin 2015, 8, Orenstein, Y.; Shamir, R. A comparative analysis of transcription factor binding models learned from PBM, HT-SELEX and ChIP data. Nucleic Acids Res. 2014, 42, e Zhao, Y.; Granas, D.; Stormo, G.D. Inferring binding energies from selected binding sites. PLoS Comput. Biol. 2009, 5, e Wang, J.; Malecka, A.; Trøenand, G.; Delabie, J. Comprehensive genome-wide transcription factor analysis reveals that a combination of high affinity and low affinity DNA binding is needed for human gene regulation. BMC Genom. 2015, 16 (Suppl. 7), doi: / s7-s Batmanov, K.; Wang, W.; Bjørås, M.; Delabie, J.; Wang, J. Integrative whole-genome sequence analysis reveals roles of regulatory mutations in BCL6 and BCL2 in follicular lymphoma. Sci. Rep. 2017, 7, Miele, V.; Vaillant, C.; d Aubenton-Carafa, Y.; Thermes, C.; Grange, T. DNA physical properties determine nucleosome occupancy from yeast to fly. Nucleic Acids Res. 2008, 36, Slattery, M.; Riley, T.; Liu, P.; Abe, N.; Gomez-Alcala, P.; Dror, I.; Zhou, T.; Rohs, R.; Honig, B.; Bussemaker, H.J.; et al. Cofactor binding evokes latent differences in DNA binding specificity between Hox proteins. Cell 2011, 147, Rohs, R.; West, S.M.; Sosinsky, A.; Liu, P.; Mann, R.S.; Honig, B. The role of DNA shape in protein-dna recognition. Nature 2009, 461, Zhou, T.; Shen, N.; Yang, L.; Abe, N.; Horton, J.; Mann, R.S.; Bussemaker, H.J.; Gordan, R.; Rohs, R. Quantitative modeling of transcription factor binding specificities using DNA shape. Proc. Natl. Acad. Sci. USA 2015, 112, Tsai, Z.T.; Shiu, S.H.; Tsai, H.K. Contribution of sequence motif, chromatin state, and DNA structure features to predictive models of transcription factor binding in Yeast. PLoS Comput. Biol. 2015, 11, e Yang, J.; Ramsey, S.A. A DNA shape-based regulatory score improves position-weight matrix-based recognition of transcription factor binding sites. Bioinformatics 2015, 31, Weirauch, M.T.; Cote, A.; Norel, R.; Annala, M.; Zhao, Y.; Riley, T.R.; Saez-Rodriguez, J.; Cokelaer, T.; Vedenko, A.; Talukder, S.; et al. Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol. 2013, 31, Friedel, M.; Nikolajewa, S.; Suhnel, J.; Wilhelm, T. DiProDB: A database for dinucleotide properties. Nucleic Acids Res. 2009, 37, D37 D Dunham, I.; Kundaje, A.; Aldred, S.F.; Collins, P.J.; Davis, C.A.; Doyle, F.; Epstein, C.B.; Frietze, S.; Harrow, J.; Kaul, R.; et al. An integrated encyclopedia of DNA elements in the human genome. Nature 2012, 489, Swinstead, E.E.; Miranda, T.B.; Paakinaho, V.; Baek, S.; Goldstein, I.; Hawkins, M.; Karpova, T.S.; Ball, D.; Mazza, D.; Lavis, L.D.; et al. Steroid receptors reprogram FoxA1 occupancy through dynamic chromatin transitions. Cell 2016, 165,

15 Genes 2017, 8, of Shi, W.; Fornes, O.; Mathelier, A.; Wasserman, W.W. Evaluating the impact of single nucleotide variants on transcription factor binding. Nucleic Acids Res. 2016, 44, Zhou, T.; Yang, L.; Lu, Y.; Dror, I.; Dantas Machado, A.C.; Ghane, T.; Di Felice, R.; Rohs, R. DNAshape: A method for the high-throughput prediction of DNA structural features on a genomic scale. Nucleic Acids Res. 2013, 41, W56 W Wang, J. The effect of prior assumptions over the weights in BayesPI with application to study protein DNA interactions from ChIP-based high-throughput data. BMC Bioinform. 2010, 11, Mackay, D. Bayesian methods for adaptive models. Ph.D. Thesis, California Institute of Technology, Pasadena, CA, USA, Bewley, C.A.; Gronenborn, A.M.; Clore, G.M. Minor groove-binding architectural proteins: Structure, function, and DNA recognition. Annu. Rev. Biophys. Biomol. Struct. 1998, 27, Lazarovici, A.; Zhou, T.; Shafer, A.; Dantas Machado, A.C.; Riley, T.R.; Sandstrom, R.; Sabo, P.J.; Lu, Y.; Rohs, R.; Stamatoyannopoulos, J.A.; et al. Probing DNA shape and methylation state on a genomic scale with DNase I. Proc. Natl. Acad. Sci. USA 2013, 110, Segal, E.; Fondufe-Mittendorf, Y.; Chen, L.; Thastrom, A.; Field, Y.; Moore, I.K.; Wang, J.P.; Widom, J. A genomic code for nucleosome positioning. Nature 2006, 442, Mathelier, A.; Fornes, O.; Arenillas, D.J.; Chen, C.Y.; Denay, G.; Lee, J.; Shi, W.; Shyr, C.; Tan, G.; Worsley-Hunt, R.; et al. JASPAR 2016: A major expansion and update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2016, 44, D110 D Wang, J.; Lan, X.; Hsu, P.Y.; Hsu, H.K.; Huang, K.; Parvin, J.; Huang, T.H.; Jin, V.X. Genome-wide analysis uncovers high frequency, strong differential chromosomal interactions and their associated epigenetic patterns in E2-mediated gene regulation. BMC Genom. 2013, 14, Jozwik, K.M.; Carroll, J.S. Pioneer factors in hormone-dependent cancers. Nat. Rev. Cancer 2012, 12, Naud, J.F.; McDuff, F.O.; Sauve, S.; Montagne, M.; Webb, B.A.; Smith, S.P.; Chabot, B.; Lavigne, P. Structural and thermodynamical characterization of the complete p21 gene product of Max. Biochemistry 2005, 44, Sato, F.; Kawamoto, T.; Fujimoto, K.; Noshiro, M.; Honda, K.K.; Honma, S.; Honma, K.; Kato, Y. Functional analysis of the basic helix-loop-helix transcription factor DEC1 in circadian regulation. Interaction with BMAL1. Eur. J. Biochem. 2004, 271, Bolshoy, A. CC dinucleotides contribute to the bending of DNA in chromatin. Nat. Struct. Biol. 1995, 2, by the authors; Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (

Supplementary Data for DNA sequence+shape kernel enables alignment-free modeling of transcription factor binding.

Supplementary Data for DNA sequence+shape kernel enables alignment-free modeling of transcription factor binding. Supplementary Data for DNA sequence+shape kernel enables alignment-free modeling of transcription factor binding. Wenxiu Ma 1, Lin Yang 2, Remo Rohs 2, and William Stafford Noble 3 1 Department of Statistics,

More information

Characterizing DNA binding sites high throughput approaches Biol4230 Tues, April 24, 2018 Bill Pearson Pinn 6-057

Characterizing DNA binding sites high throughput approaches Biol4230 Tues, April 24, 2018 Bill Pearson Pinn 6-057 Characterizing DNA binding sites high throughput approaches Biol4230 Tues, April 24, 2018 Bill Pearson wrp@virginia.edu 4-2818 Pinn 6-057 Reviewing sites: affinity and specificity representation binding

More information

Transcription factor binding site prediction in vivo using DNA sequence and shape features

Transcription factor binding site prediction in vivo using DNA sequence and shape features Transcription factor binding site prediction in vivo using DNA sequence and shape features Anthony Mathelier, Lin Yang, Tsu-Pei Chiu, Remo Rohs, and Wyeth Wasserman anthony.mathelier@gmail.com @AMathelier

More information

BayesPI-BAR: a new biophysical model for characterization of regulatory sequence variations

BayesPI-BAR: a new biophysical model for characterization of regulatory sequence variations Published online 21 July 2015 Nucleic Acids Research, 2015, Vol. 43, No. 21 e147 doi: 10.1093/nar/gkv733 BayesPI-BAR: a new biophysical model for characterization of regulatory sequence variations Junbai

More information

Epigenetics and DNase-Seq

Epigenetics and DNase-Seq Epigenetics and DNase-Seq BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2018 Anthony Gitter gitter@biostat.wisc.edu These slides, excluding third-party material, are licensed under CC BY-NC 4.0 by Anthony

More information

Title: Genome-Wide Predictions of Transcription Factor Binding Events using Multi- Dimensional Genomic and Epigenomic Features Background

Title: Genome-Wide Predictions of Transcription Factor Binding Events using Multi- Dimensional Genomic and Epigenomic Features Background Title: Genome-Wide Predictions of Transcription Factor Binding Events using Multi- Dimensional Genomic and Epigenomic Features Team members: David Moskowitz and Emily Tsang Background Transcription factors

More information

BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology. Lecture 2: Microarray analysis

BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology. Lecture 2: Microarray analysis BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology Lecture 2: Microarray analysis Genome wide measurement of gene transcription using DNA microarray Bruce Alberts, et al., Molecular Biology

More information

DOUBLE-STRAND DNA BREAK PREDICTION USING EPIGENOME MARKS AT KILOBASE RESOLUTION

DOUBLE-STRAND DNA BREAK PREDICTION USING EPIGENOME MARKS AT KILOBASE RESOLUTION DOUBLE-STRAND DNA BREAK PREDICTION USING EPIGENOME MARKS AT KILOBASE RESOLUTION Raphaël MOURAD, Assist. Prof. Centre de Biologie Intégrative Université Paul Sabatier, Toulouse III INTRODUCTION Double-strand

More information

Supplementary Table 1: Oligo designs. A list of ATAC-seq oligos used for PCR.

Supplementary Table 1: Oligo designs. A list of ATAC-seq oligos used for PCR. Ad1_noMX: Ad2.1_TAAGGCGA Ad2.2_CGTACTAG Ad2.3_AGGCAGAA Ad2.4_TCCTGAGC Ad2.5_GGACTCCT Ad2.6_TAGGCATG Ad2.7_CTCTCTAC Ad2.8_CAGAGAGG Ad2.9_GCTACGCT Ad2.10_CGAGGCTG Ad2.11_AAGAGGCA Ad2.12_GTAGAGGA Ad2.13_GTCGTGAT

More information

Discovery of Transcription Factor Binding Sites with Deep Convolutional Neural Networks

Discovery of Transcription Factor Binding Sites with Deep Convolutional Neural Networks Discovery of Transcription Factor Binding Sites with Deep Convolutional Neural Networks Reesab Pathak Dept. of Computer Science Stanford University rpathak@stanford.edu Abstract Transcription factors are

More information

Nature Biotechnology: doi: /nbt Supplementary Figure 1

Nature Biotechnology: doi: /nbt Supplementary Figure 1 Supplementary Figure 1 An extended version of Figure 2a, depicting multi-model training and reverse-complement mode To use the GPU s full computational power, we train several independent models in parallel

More information

Deep learning frameworks for regulatory genomics and epigenomics

Deep learning frameworks for regulatory genomics and epigenomics Deep learning frameworks for regulatory genomics and epigenomics Chuan Sheng Foo Avanti Shrikumar Nicholas Sinnott- Armstrong ANSHUL KUNDAJE Genetics, Computer science Stanford University Johnny Israeli

More information

CS273B: Deep learning for Genomics and Biomedicine

CS273B: Deep learning for Genomics and Biomedicine CS273B: Deep learning for Genomics and Biomedicine Lecture 2: Convolutional neural networks and applications to functional genomics 09/28/2016 Anshul Kundaje, James Zou, Serafim Batzoglou Outline Anatomy

More information

Nature Genetics: doi: /ng.3254

Nature Genetics: doi: /ng.3254 Supplementary Figure 1 Comparing the inferred histories of the stairway plot and the PSMC method using simulated samples based on five models. (a) PSMC sim-1 model. (b) PSMC sim-2 model. (c) PSMC sim-3

More information

Methods and tools for exploring functional genomics data

Methods and tools for exploring functional genomics data Methods and tools for exploring functional genomics data William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington Outline Searching for

More information

and Promoter Sequence Data

and Promoter Sequence Data : Combining Gene Expression and Promoter Sequence Data Outline 1. Motivation Functionally related genes cluster together genes sharing cis-elements cluster together transcriptional regulation is modular

More information

Understanding transcriptional regulation by integrative analysis of transcription factor binding data

Understanding transcriptional regulation by integrative analysis of transcription factor binding data Understanding transcriptional regulation by integrative analysis of transcription factor binding data Cheng et al. 2012 Shu Yang Feb. 21, 2013 1 / 26 Introduction 2 / 26 DNA-binding Proteins sequence-specific

More information

1 Najafabadi, H. S. et al. C2H2 zinc finger proteins greatly expand the human regulatory lexicon. Nat Biotechnol doi: /nbt.3128 (2015).

1 Najafabadi, H. S. et al. C2H2 zinc finger proteins greatly expand the human regulatory lexicon. Nat Biotechnol doi: /nbt.3128 (2015). F op-scoring motif Optimized motifs E Input sequences entral 1 bp region Dinucleotideshuffled seqs B D ll B1H-R predicted motifs Enriched B1H- R predicted motifs L!=!7! L!=!6! L!=5! L!=!4! L!=!3! L!=!2!

More information

Machine Learning. HMM applications in computational biology

Machine Learning. HMM applications in computational biology 10-601 Machine Learning HMM applications in computational biology Central dogma DNA CCTGAGCCAACTATTGATGAA transcription mrna CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE 2 Biological data is rapidly

More information

2/10/17. Contents. Applications of HMMs in Epigenomics

2/10/17. Contents. Applications of HMMs in Epigenomics 2/10/17 I529: Machine Learning in Bioinformatics (Spring 2017) Contents Applications of HMMs in Epigenomics Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2017 Background:

More information

2. Materials and Methods

2. Materials and Methods Identification of cancer-relevant Variations in a Novel Human Genome Sequence Robert Bruggner, Amir Ghazvinian 1, & Lekan Wang 1 CS229 Final Report, Fall 2009 1. Introduction Cancer affects people of all

More information

Applications of HMMs in Epigenomics

Applications of HMMs in Epigenomics I529: Machine Learning in Bioinformatics (Spring 2013) Applications of HMMs in Epigenomics Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2013 Contents Background:

More information

ChIP-seq and RNA-seq. Farhat Habib

ChIP-seq and RNA-seq. Farhat Habib ChIP-seq and RNA-seq Farhat Habib fhabib@iiserpune.ac.in Biological Goals Learn how genomes encode the diverse patterns of gene expression that define each cell type and state. Protein-DNA interactions

More information

Genome 541 Gene regulation and epigenomics Lecture 3 Integrative analysis of genomics assays

Genome 541 Gene regulation and epigenomics Lecture 3 Integrative analysis of genomics assays Genome 541 Gene regulation and epigenomics Lecture 3 Integrative analysis of genomics assays Please consider both the forward and reverse strands (i.e. reverse compliment sequence). You do not need to

More information

Non-coding Function & Variation, MPRAs. Mike White Bio5488 3/5/18

Non-coding Function & Variation, MPRAs. Mike White Bio5488 3/5/18 Non-coding Function & Variation, MPRAs Mike White Bio5488 3/5/18 Outline MONDAY Non-coding function and variation The barcode Basic versions of MRPA technology WEDNESDAY More varieties of MRPAs Some key

More information

Predicting prokaryotic incubation times from genomic features Maeva Fincker - Final report

Predicting prokaryotic incubation times from genomic features Maeva Fincker - Final report Predicting prokaryotic incubation times from genomic features Maeva Fincker - mfincker@stanford.edu Final report Introduction We have barely scratched the surface when it comes to microbial diversity.

More information

Genome 541! Unit 4, lecture 3! Genomics assays

Genome 541! Unit 4, lecture 3! Genomics assays Genome 541! Unit 4, lecture 3! Genomics assays I d like a bit more background on the assays and bioterminology.!! The phantom peak concept was confusing.! I didn t quite understand what the phantom peak

More information

Structural Bioinformatics (C3210) Conformational Analysis Protein Folding Protein Structure Prediction

Structural Bioinformatics (C3210) Conformational Analysis Protein Folding Protein Structure Prediction Structural Bioinformatics (C3210) Conformational Analysis Protein Folding Protein Structure Prediction Conformational Analysis 2 Conformational Analysis Properties of molecules depend on their three-dimensional

More information

Poly(dA:dT)-rich DNAs are highly flexible in the context of DNA looping: Supporting Information

Poly(dA:dT)-rich DNAs are highly flexible in the context of DNA looping: Supporting Information Poly(dA:dT)-rich DNAs are highly flexible in the context of DNA looping: Supporting Information Stephanie Johnson, Yi-Ju Chen, and Rob Phillips List of Figures S1 No-promoter looping sequences used in

More information

2/19/13. Contents. Applications of HMMs in Epigenomics

2/19/13. Contents. Applications of HMMs in Epigenomics 2/19/13 I529: Machine Learning in Bioinformatics (Spring 2013) Contents Applications of HMMs in Epigenomics Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2013 Background:

More information

Learning Methods for DNA Binding in Computational Biology

Learning Methods for DNA Binding in Computational Biology Learning Methods for DNA Binding in Computational Biology Mark Kon Dustin Holloway Yue Fan Chaitanya Sai Charles DeLisi Boston University IJCNN Orlando August 16, 2007 Outline Background on Transcription

More information

Neural Networks and Applications in Bioinformatics

Neural Networks and Applications in Bioinformatics Contents Neural Networks and Applications in Bioinformatics Yuzhen Ye School of Informatics and Computing, Indiana University Biological problem: promoter modeling Basics of neural networks Perceptrons

More information

Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk

Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk Summer Review 7 Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk Jian Zhou 1,2,3, Chandra L. Theesfeld 1, Kevin Yao 3, Kathleen M. Chen 3, Aaron K. Wong

More information

L8: Downstream analysis of ChIP-seq and ATAC-seq data

L8: Downstream analysis of ChIP-seq and ATAC-seq data L8: Downstream analysis of ChIP-seq and ATAC-seq data Shamith Samarajiwa CRUK Bioinformatics Autumn School September 2017 Summary Downstream analysis for extracting meaningful biology : Normalization and

More information

NGS Approaches to Epigenomics

NGS Approaches to Epigenomics I519 Introduction to Bioinformatics, 2013 NGS Approaches to Epigenomics Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Contents Background: chromatin structure & DNA methylation Epigenomic

More information

Lecture 5: Regulation

Lecture 5: Regulation Machine Learning in Computational Biology CSC 2431 Lecture 5: Regulation Instructor: Anna Goldenberg Central Dogma of Biology Transcription DNA RNA protein Process of producing RNA from DNA Constitutive

More information

Microarrays & Gene Expression Analysis

Microarrays & Gene Expression Analysis Microarrays & Gene Expression Analysis Contents DNA microarray technique Why measure gene expression Clustering algorithms Relation to Cancer SAGE SBH Sequencing By Hybridization DNA Microarrays 1. Developed

More information

Neural Networks and Applications in Bioinformatics. Yuzhen Ye School of Informatics and Computing, Indiana University

Neural Networks and Applications in Bioinformatics. Yuzhen Ye School of Informatics and Computing, Indiana University Neural Networks and Applications in Bioinformatics Yuzhen Ye School of Informatics and Computing, Indiana University Contents Biological problem: promoter modeling Basics of neural networks Perceptrons

More information

Assessment of algorithms for inferring positional weight matrix motifs of transcription factor binding sites using protein binding microarray data

Assessment of algorithms for inferring positional weight matrix motifs of transcription factor binding sites using protein binding microarray data Assessment of algorithms for inferring positional weight matrix motifs of transcription factor binding sites using protein binding microarray data Yaron Orenstein 1, Chaim Linhart 1 and Ron Shamir 1,*

More information

Data Mining for Biological Data Analysis

Data Mining for Biological Data Analysis Data Mining for Biological Data Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Data Mining Course by Gregory-Platesky Shapiro available at www.kdnuggets.com Jiawei Han

More information

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology. G16B BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY Methods or systems for genetic

More information

Genomic Regions Flanking E-Box Binding Sites Influence DNA Binding Specificity of bhlh Transcription Factors through DNA Shape

Genomic Regions Flanking E-Box Binding Sites Influence DNA Binding Specificity of bhlh Transcription Factors through DNA Shape Cell Reports Article Genomic Regions Flanking E-Box Binding Sites Influence DNA Binding Specificity of bhlh Transcription Factors through DNA Shape Raluca Gordân, 1,7 Ning Shen, 3,6 Iris Dror, 5,6 Tianyin

More information

ChIP. November 21, 2017

ChIP. November 21, 2017 ChIP November 21, 2017 functional signals: is DNA enough? what is the smallest number of letters used by a written language? DNA is only one part of the functional genome DNA is heavily bound by proteins,

More information

Practice Exam A. Briefly describe how IL-25 treatment might be able to help this responder subgroup of liver cancer patients.

Practice Exam A. Briefly describe how IL-25 treatment might be able to help this responder subgroup of liver cancer patients. Practice Exam 2007 1. A special JAK-STAT signaling system (JAK5-STAT5) was recently identified in which a gene called TS5 becomes selectively transcribed and expressed in the liver upon induction by a

More information

Statistical Methods for Network Analysis of Biological Data

Statistical Methods for Network Analysis of Biological Data The Protein Interaction Workshop, 8 12 June 2015, IMS Statistical Methods for Network Analysis of Biological Data Minghua Deng, dengmh@pku.edu.cn School of Mathematical Sciences Center for Quantitative

More information

Discovering gene regulatory control using ChIP-chip and ChIP-seq. Part 1. An introduction to gene regulatory control, concepts and methodologies

Discovering gene regulatory control using ChIP-chip and ChIP-seq. Part 1. An introduction to gene regulatory control, concepts and methodologies Discovering gene regulatory control using ChIP-chip and ChIP-seq Part 1 An introduction to gene regulatory control, concepts and methodologies Ian Simpson ian.simpson@.ed.ac.uk http://bit.ly/bio2links

More information

Enhancing motif finding models using multiple sources of genome-wide data

Enhancing motif finding models using multiple sources of genome-wide data Enhancing motif finding models using multiple sources of genome-wide data Heejung Shim 1, Oliver Bembom 2, Sündüz Keleş 3,4 1 Department of Human Genetics, University of Chicago, 2 Division of Biostatistics,

More information

Suppl. Table S1. Characteristics of DHS regions analyzed by bisulfite sequencing. No. CpGs analyzed in the amplicon. Genomic location specificity

Suppl. Table S1. Characteristics of DHS regions analyzed by bisulfite sequencing. No. CpGs analyzed in the amplicon. Genomic location specificity Suppl. Table S1. Characteristics of DHS regions analyzed by bisulfite sequencing. DHS/GRE Genomic location Tissue specificity DHS type CpG density (per 100 bp) No. CpGs analyzed in the amplicon CpG within

More information

Shin Lin CS229 Final Project Identifying Transcription Factor Binding by the DNase Hypersensitivity Assay

Shin Lin CS229 Final Project Identifying Transcription Factor Binding by the DNase Hypersensitivity Assay BACKGROUND In the DNA of cell nuclei, transcription factors (TF) bind regulatory regions throughout the genome in tissue specific patterns. The binding occurs for three reasons: 1) the TF is present, 2)

More information

Gene expression analysis. Biosciences 741: Genomics Fall, 2013 Week 5. Gene expression analysis

Gene expression analysis. Biosciences 741: Genomics Fall, 2013 Week 5. Gene expression analysis Gene expression analysis Biosciences 741: Genomics Fall, 2013 Week 5 Gene expression analysis From EST clusters to spotted cdna microarrays Long vs. short oligonucleotide microarrays vs. RT-PCR Methods

More information

Editorial. Current Computational Models for Prediction of the Varied Interactions Related to Non-Coding RNAs

Editorial. Current Computational Models for Prediction of the Varied Interactions Related to Non-Coding RNAs Editorial Current Computational Models for Prediction of the Varied Interactions Related to Non-Coding RNAs Xing Chen 1,*, Huiming Peng 2, Zheng Yin 3 1 School of Information and Electrical Engineering,

More information

Figure S1. nuclear extracts. HeLa cell nuclear extract. Input IgG IP:ORC2 ORC2 ORC2. MCM4 origin. ORC2 occupancy

Figure S1. nuclear extracts. HeLa cell nuclear extract. Input IgG IP:ORC2 ORC2 ORC2. MCM4 origin. ORC2 occupancy A nuclear extracts B HeLa cell nuclear extract Figure S1 ORC2 (in kda) 21 132 7 ORC2 Input IgG IP:ORC2 32 ORC C D PRKDC ORC2 occupancy Directed against ORC2 C-terminus (sc-272) MCM origin 2 2 1-1 -1kb

More information

Nature Genetics: doi: /ng Supplementary Figure 1. H3K27ac HiChIP enriches enhancer promoter-associated chromatin contacts.

Nature Genetics: doi: /ng Supplementary Figure 1. H3K27ac HiChIP enriches enhancer promoter-associated chromatin contacts. Supplementary Figure 1 H3K27ac HiChIP enriches enhancer promoter-associated chromatin contacts. (a) Schematic of chromatin contacts captured in H3K27ac HiChIP. (b) Loop call overlap for cohesin HiChIP

More information

Computational Analysis of Ultra-high-throughput sequencing data: ChIP-Seq

Computational Analysis of Ultra-high-throughput sequencing data: ChIP-Seq Computational Analysis of Ultra-high-throughput sequencing data: ChIP-Seq Philipp Bucher Wednesday January 21, 2009 SIB graduate school course EPFL, Lausanne Data flow in ChIP-Seq data analysis Level 1:

More information

Finding Compensatory Pathways in Yeast Genome

Finding Compensatory Pathways in Yeast Genome Finding Compensatory Pathways in Yeast Genome Olga Ohrimenko Abstract Pathways of genes found in protein interaction networks are used to establish a functional linkage between genes. A challenging problem

More information

Positional Preference of Rho-Independent Transcriptional Terminators in E. Coli

Positional Preference of Rho-Independent Transcriptional Terminators in E. Coli Positional Preference of Rho-Independent Transcriptional Terminators in E. Coli Annie Vo Introduction Gene expression can be regulated at the transcriptional level through the activities of terminators.

More information

The ChIP-Seq project. Giovanna Ambrosini, Philipp Bucher. April 19, 2010 Lausanne. EPFL-SV Bucher Group

The ChIP-Seq project. Giovanna Ambrosini, Philipp Bucher. April 19, 2010 Lausanne. EPFL-SV Bucher Group The ChIP-Seq project Giovanna Ambrosini, Philipp Bucher EPFL-SV Bucher Group April 19, 2010 Lausanne Overview Focus on technical aspects Description of applications (C programs) Where to find binaries,

More information

ChIP-seq and RNA-seq

ChIP-seq and RNA-seq ChIP-seq and RNA-seq Biological Goals Learn how genomes encode the diverse patterns of gene expression that define each cell type and state. Protein-DNA interactions (ChIPchromatin immunoprecipitation)

More information

Human housekeeping genes are compact

Human housekeeping genes are compact Human housekeeping genes are compact Eli Eisenberg and Erez Y. Levanon Compugen Ltd., 72 Pinchas Rosen Street, Tel Aviv 69512, Israel Abstract arxiv:q-bio/0309020v1 [q-bio.gn] 30 Sep 2003 We identify a

More information

Functional microrna targets in protein coding sequences. Merve Çakır

Functional microrna targets in protein coding sequences. Merve Çakır Functional microrna targets in protein coding sequences Martin Reczko, Manolis Maragkakis, Panagiotis Alexiou, Ivo Grosse, Artemis G. Hatzigeorgiou Merve Çakır 27.04.2012 microrna * micrornas are small

More information

Assessment of Algorithms for Inferring Positional Weight Matrix Motifs of Transcription Factor Binding Sites Using Protein Binding Microarray Data

Assessment of Algorithms for Inferring Positional Weight Matrix Motifs of Transcription Factor Binding Sites Using Protein Binding Microarray Data Assessment of Algorithms for Inferring Positional Weight Matrix Motifs of Transcription Factor Binding Sites Using Protein Binding Microarray Data Yaron Orenstein, Chaim Linhart, Ron Shamir* Blavatnik

More information

Molecular Biology (BIOL 4320) Exam #1 March 12, 2002

Molecular Biology (BIOL 4320) Exam #1 March 12, 2002 Molecular Biology (BIOL 4320) Exam #1 March 12, 2002 Name KEY SS# This exam is worth a total of 100 points. The number of points each question is worth is shown in parentheses after the question number.

More information

Generating & Designing DNA with Deep Generative Models. Nathan Killoran, Leo J. Lee, Andrew Delong, David Duvenaud, Brendan J. Fey

Generating & Designing DNA with Deep Generative Models. Nathan Killoran, Leo J. Lee, Andrew Delong, David Duvenaud, Brendan J. Fey Generating & Designing DNA with Deep Generative Models Nathan Killoran, Leo J. Lee, Andrew Delong, David Duvenaud, Brendan J. Fey Presented by Yomna Omar on February 16th, 2018 What is DNA sequencing?

More information

Gene splice sites correlate with nucleosome positions

Gene splice sites correlate with nucleosome positions Gene splice sites correlate with nucleosome positions Simon Kogan and Edward N. Trifonov* Genome Diversity Center, Institute of Evolution, University of Haifa, Mount Carmel, Haifa 31905, Israel Abstract

More information

Drift versus Draft - Classifying the Dynamics of Neutral Evolution

Drift versus Draft - Classifying the Dynamics of Neutral Evolution Drift versus Draft - Classifying the Dynamics of Neutral Evolution Alison Feder December 3, 203 Introduction Early stages of this project were discussed with Dr. Philipp Messer Evolutionary biologists

More information

Genome-Wide Survey of MicroRNA - Transcription Factor Feed-Forward Regulatory Circuits in Human. Supporting Information

Genome-Wide Survey of MicroRNA - Transcription Factor Feed-Forward Regulatory Circuits in Human. Supporting Information Genome-Wide Survey of MicroRNA - Transcription Factor Feed-Forward Regulatory Circuits in Human Angela Re #, Davide Corá #, Daniela Taverna and Michele Caselle # equal contribution * corresponding author,

More information

Protein-Protein-Interaction Networks. Ulf Leser, Samira Jaeger

Protein-Protein-Interaction Networks. Ulf Leser, Samira Jaeger Protein-Protein-Interaction Networks Ulf Leser, Samira Jaeger This Lecture Protein-protein interactions Characteristics Experimental detection methods Databases Biological networks Ulf Leser: Introduction

More information

Year III Pharm.D Dr. V. Chitra

Year III Pharm.D Dr. V. Chitra Year III Pharm.D Dr. V. Chitra 1 Genome entire genetic material of an individual Transcriptome set of transcribed sequences Proteome set of proteins encoded by the genome 2 Only one strand of DNA serves

More information

Non-coding Function & Variation, MPRAs II. Mike White Bio /5/18

Non-coding Function & Variation, MPRAs II. Mike White Bio /5/18 Non-coding Function & Variation, MPRAs II Mike White Bio 5488 3/5/18 MPRA Review Problem 1: Where does your CRE DNA come from? DNA synthesis Genomic fragments Targeted regulome capture Problem 2: How do

More information

Nature Methods: doi: /nmeth Supplementary Figure 1. Construction of a sensitive TetR mediated auxotrophic off-switch.

Nature Methods: doi: /nmeth Supplementary Figure 1. Construction of a sensitive TetR mediated auxotrophic off-switch. Supplementary Figure 1 Construction of a sensitive TetR mediated auxotrophic off-switch. A Production of the Tet repressor in yeast when conjugated to either the LexA4 or LexA8 promoter DNA binding sequences.

More information

Huijuan Feng, Shining Ma,Chao Ye & Zhixing Feng

Huijuan Feng, Shining Ma,Chao Ye & Zhixing Feng Huijuan Feng, Shining Ma,Chao Ye & Zhixing Feng Background-Author introduction Research interest: Methods for gene mapping of complex traits Inference of population structure from genetic data Genome variation

More information

Supplementary Figure 1 Strategy for parallel detection of DHSs and adjacent nucleosomes

Supplementary Figure 1 Strategy for parallel detection of DHSs and adjacent nucleosomes Supplementary Figure 1 Strategy for parallel detection of DHSs and adjacent nucleosomes DNase I cleavage DNase I DNase I digestion Sucrose gradient enrichment Small Large F1 F2...... F9 F1 F1 F2 F3 F4

More information

Department of Psychology, Ben Gurion University of the Negev, Beer Sheva, Israel;

Department of Psychology, Ben Gurion University of the Negev, Beer Sheva, Israel; Polygenic Selection, Polygenic Scores, Spatial Autocorrelation and Correlated Allele Frequencies. Can We Model Polygenic Selection on Intellectual Abilities? Davide Piffer Department of Psychology, Ben

More information

ChIP-seq analysis 2/28/2018

ChIP-seq analysis 2/28/2018 ChIP-seq analysis 2/28/2018 Acknowledgements Much of the content of this lecture is from: Furey (2012) ChIP-seq and beyond Park (2009) ChIP-seq advantages + challenges Landt et al. (2012) ChIP-seq guidelines

More information

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday 15 June 2015

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday 15 June 2015 ChIP-Seq Data Analysis J Fass UCD Genome Center Bioinformatics Core Wednesday 15 June 2015 What s the Question? Where do Transcription Factors (TFs) bind genomic DNA 1? (Where do other things bind DNA

More information

ChIP-Seq Tools. J Fass UCD Genome Center Bioinformatics Core Wednesday September 16, 2015

ChIP-Seq Tools. J Fass UCD Genome Center Bioinformatics Core Wednesday September 16, 2015 ChIP-Seq Tools J Fass UCD Genome Center Bioinformatics Core Wednesday September 16, 2015 What s the Question? Where do Transcription Factors (TFs) bind genomic DNA 1? (Where do other things bind DNA or

More information

Distinct Modes of Regulation by Chromatin Encoded through Nucleosome Positioning Signals

Distinct Modes of Regulation by Chromatin Encoded through Nucleosome Positioning Signals Encoded through Nucleosome Positioning Signals Yair Field 1., Noam Kaplan 1., Yvonne Fondufe-Mittendorf 2., Irene K. Moore 2, Eilon Sharon 1, Yaniv Lubling 1, Jonathan Widom 2 *, Eran Segal 1,3 * 1 Department

More information

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014 ChIP-Seq Data Analysis J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014 What s the Question? Where do Transcription Factors (TFs) bind genomic DNA 1? (Where do other things bind

More information

Discovering gene regulatory control using ChIP-chip and ChIP-seq. An introduction to gene regulatory control, concepts and methodologies

Discovering gene regulatory control using ChIP-chip and ChIP-seq. An introduction to gene regulatory control, concepts and methodologies Discovering gene regulatory control using ChIP-chip and ChIP-seq An introduction to gene regulatory control, concepts and methodologies Ian Simpson ian.simpson@.ed.ac.uk bit.ly/bio2_2012 The Central Dogma

More information

Gene Expression Data Analysis

Gene Expression Data Analysis Gene Expression Data Analysis Bing Zhang Department of Biomedical Informatics Vanderbilt University bing.zhang@vanderbilt.edu BMIF 310, Fall 2009 Gene expression technologies (summary) Hybridization-based

More information

Study on the Application of Data Mining in Bioinformatics. Mingyang Yuan

Study on the Application of Data Mining in Bioinformatics. Mingyang Yuan International Conference on Mechatronics Engineering and Information Technology (ICMEIT 2016) Study on the Application of Mining in Bioinformatics Mingyang Yuan School of Science and Liberal Arts, New

More information

What Can the Epigenome Teach Us About Cellular States and Diseases?

What Can the Epigenome Teach Us About Cellular States and Diseases? What Can the Epigenome Teach Us About Cellular States and Diseases? (a computer scientist s view) Luca Pinello Outline Epigenetic: the code over the code What can we learn from epigenomic data? Resources

More information

Bayesian Variable Selection and Data Integration for Biological Regulatory Networks

Bayesian Variable Selection and Data Integration for Biological Regulatory Networks Bayesian Variable Selection and Data Integration for Biological Regulatory Networks Shane T. Jensen Department of Statistics The Wharton School, University of Pennsylvania stjensen@wharton.upenn.edu Gary

More information

Chapter 19 Genetic Regulation of the Eukaryotic Genome. A. Bergeron AP Biology PCHS

Chapter 19 Genetic Regulation of the Eukaryotic Genome. A. Bergeron AP Biology PCHS Chapter 19 Genetic Regulation of the Eukaryotic Genome A. Bergeron AP Biology PCHS 2 Do Now - Eukaryotic Transcription Regulation The diagram below shows five genes (with their enhancers) from the genome

More information

Convolutional Kitchen Sinks for Transcription Factor Binding Site Prediction

Convolutional Kitchen Sinks for Transcription Factor Binding Site Prediction Convolutional Kitchen Sinks for Transcription Factor Binding Site Prediction Alyssa Morrow*, Vaishaal Shankar* Anthony Joseph, Benjamin Recht, Nir Yosef Transcription Factor A protein that binds to DNA

More information

Bioinformatics for Biologists

Bioinformatics for Biologists Bioinformatics for Biologists Functional Genomics: Microarray Data Analysis Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute Outline Introduction Working with microarray data Normalization Analysis

More information

Motivation From Protein to Gene

Motivation From Protein to Gene MOLECULAR BIOLOGY 2003-4 Topic B Recombinant DNA -principles and tools Construct a library - what for, how Major techniques +principles Bioinformatics - in brief Chapter 7 (MCB) 1 Motivation From Protein

More information

Gene function prediction. Computational analysis of biological networks. Olga Troyanskaya, PhD

Gene function prediction. Computational analysis of biological networks. Olga Troyanskaya, PhD Gene function prediction Computational analysis of biological networks. Olga Troyanskaya, PhD Available Data Coexpression - Microarrays Cells of Interest Known DNA sequences Isolate mrna Glass slide Resulting

More information

ChIP-seq data analysis with Chipster. Eija Korpelainen CSC IT Center for Science, Finland

ChIP-seq data analysis with Chipster. Eija Korpelainen CSC IT Center for Science, Finland ChIP-seq data analysis with Chipster Eija Korpelainen CSC IT Center for Science, Finland chipster@csc.fi What will I learn? Short introduction to ChIP-seq Analyzing ChIP-seq data Central concepts Analysis

More information

BIOINFORMATICS THE MACHINE LEARNING APPROACH

BIOINFORMATICS THE MACHINE LEARNING APPROACH 88 Proceedings of the 4 th International Conference on Informatics and Information Technology BIOINFORMATICS THE MACHINE LEARNING APPROACH A. Madevska-Bogdanova Inst, Informatics, Fac. Natural Sc. and

More information

Permutation Clustering of the DNA Sequence Facilitates Understanding of the Nonlinearly Organized Genome

Permutation Clustering of the DNA Sequence Facilitates Understanding of the Nonlinearly Organized Genome RESEARCH PROPOSAL Permutation Clustering of the DNA Sequence Facilitates Understanding of the Nonlinearly Organized Genome Qiao JIN School of Medicine, Tsinghua University Advisor: Prof. Xuegong ZHANG

More information

Section C: The Control of Gene Expression

Section C: The Control of Gene Expression Section C: The Control of Gene Expression 1. Each cell of a multicellular eukaryote expresses only a small fraction of its genes 2. The control of gene expression can occur at any step in the pathway from

More information

Simultaneous profiling of transcriptome and DNA methylome from a single cell

Simultaneous profiling of transcriptome and DNA methylome from a single cell Additional file 1: Supplementary materials Simultaneous profiling of transcriptome and DNA methylome from a single cell Youjin Hu 1, 2, Kevin Huang 1, 3, Qin An 1, Guizhen Du 1, Ganlu Hu 2, Jinfeng Xue

More information

Parameters tuning boosts hypersmurf predictions of rare deleterious non-coding genetic variants

Parameters tuning boosts hypersmurf predictions of rare deleterious non-coding genetic variants Parameters tuning boosts hypersmurf predictions of rare deleterious non-coding genetic variants The regulatory code that determines whether and how a given genetic variant affects the function of a regulatory

More information

Bioinformatics : Gene Expression Data Analysis

Bioinformatics : Gene Expression Data Analysis 05.12.03 Bioinformatics : Gene Expression Data Analysis Aidong Zhang Professor Computer Science and Engineering What is Bioinformatics Broad Definition The study of how information technologies are used

More information

A Brief History. Bootstrapping. Bagging. Boosting (Schapire 1989) Adaboost (Schapire 1995)

A Brief History. Bootstrapping. Bagging. Boosting (Schapire 1989) Adaboost (Schapire 1995) A Brief History Bootstrapping Bagging Boosting (Schapire 1989) Adaboost (Schapire 1995) What s So Good About Adaboost Improves classification accuracy Can be used with many different classifiers Commonly

More information

Computational Systems Biology Deep Learning in the Life Sciences

Computational Systems Biology Deep Learning in the Life Sciences Computational Systems Biology Deep Learning in the Life Sciences 6.802 6.874 20.390 20.490 HST.506 Christina Ji April 6, 2017 DanQ: a hybrid convolutional and recurrent deep neural network for quantifying

More information

The first thing you will see is the opening page. SeqMonk scans your copy and make sure everything is in order, indicated by the green check marks.

The first thing you will see is the opening page. SeqMonk scans your copy and make sure everything is in order, indicated by the green check marks. Open Seqmonk Launch SeqMonk The first thing you will see is the opening page. SeqMonk scans your copy and make sure everything is in order, indicated by the green check marks. SeqMonk Analysis Page 1 Create

More information

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University Machine learning applications in genomics: practical issues & challenges Yuzhen Ye School of Informatics and Computing, Indiana University Reference Machine learning applications in genetics and genomics

More information

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist Whole Transcriptome Analysis of Illumina RNA- Seq Data Ryan Peters Field Application Specialist Partek GS in your NGS Pipeline Your Start-to-Finish Solution for Analysis of Next Generation Sequencing Data

More information