Characterization of transcription factor binding sites by high-throughput SELEX. Overview of the HTPSELEX Database

Size: px

Start display at page:

Download "Characterization of transcription factor binding sites by high-throughput SELEX. Overview of the HTPSELEX Database"

Stephanie Johnston
6 years ago
Views:

1 Characterization of transcription factor binding sites by high-throughput SELEX Overview of the HPSELEX Database ranscription Factor Binding Sites: Features and Facts Degenerate sequence motifs ypical length: 6-20 bp Low information content: 8-12 bits (1 site per bp) Quantitative recognition mechanism: measurable affinity of different sites may vary over three orders of magnitude Regulatory function often depends on cooperative interactions with neighboring sites

2 Representation of the binding specificity by a scoring matrix (also referred to as weight matrix) A C G Strong C G A C Binding site = 43 Random A C G A C G A Sequence = -83 itle

Physical interpretation of an weight matrix Weight matrix elements represent relative binding energies between DNA base-pairs and protein surface areas (base-pair acceptor sites).

3 Physical interpretation of an weight matrix Weight matrix elements represent relative binding energies between DNA base-pairs and protein surface areas (base-pair acceptor sites). A weight matrix column describes the base preferences of a base-pair acceptor site. Berg-von Hippel model of protein-dna interactions he weight matrix score expresses the binding free energy of protein-dna complex in arbitrary units: It is convenient to express the binding free energy in dimension-free R units: G( x) = S( x) + const. S(x) = N i= 1 N w i ( x i ) E( x) = ε ( i x i i= 1 εi ( b) wi ( b) R ) On a relative scale, the binding constant for sequence x is given by: K = e rel ( x) For sequences longer than the weight matrix: 1 1 K ( x) = or K (... ) rel( x) = E xi xi+ N 1 E( xi... xi+ N e max e rel 1 ) i i E( x) (index i runs over all subsequence starting positions on both strands)

4 Berg-von Hippel heory Information Content he energy terms of a weight matrix can be computed from the base frequencies p i (b) found in in vitro or in vivo selected binding sites: q(b) is the background frequency of base b. 1 pi ( b) ε i ( b) = ln λ q( b) λ is an unknown parameters related to the stringency of the binding conditions. he information content of a binding site has been defined as the conditional entropy of the base frequency matrix relative to back-ground base frequencies. IC = N i= 1 b= A p ( b)log i 2 pi ( b) q( b) Paradox: λ depends on selection conditions (e.g. the protein concentration) - therefore the base frequencies observed in selected binding sites do not reflect a protein-intrinsic property. Weight matrices/profiles from a biochemical and viewpoint A weight matrix expresses the sequence specificity of a DNA binding proteins. A column describes the base preferences of a surface area of the DNAbinding protein. Weights of a weight matrix can be interpreted as additive binding energy contributions. No interactions between binding site positions! According to the Berg-von Hippel theory negated binding energies are proportional to the logarithms of the base frequencies observed in an in vivo or in vitro selected set of binding sites. Weight matrices can thus be used to compute relative binding energies or dissociation constants for oligonucleotides of any sequence, which in turn can be experimentally determined by gel shift experiments. An accurate weight matrix for the binding specificity of a transcription factor is one that accurately predicts binding constants.

5 Experimental techniques for estimating the parameters of a F specificity matrix Competitive bandshifts (EMSA) rel. binding constants of oligonucletides Alignment of in vivo sites base frequency matrix (from sequences) in vitro selection (SELEX) base frequency matrix (up to 200 sequences) SAGE/SELEX base frequency matrix (up to binding sequences) Exhaustive mutagenesis + K rel assay intrinsic specificity matrix Protein binding arrays + magic algorithm intrinsic specificity matrix Some problems and limitations: A base probability matrix is generate by an alignment or probabilistic modeling algorithm no direct observation K rel usually not very precise (within factor of 2) Point mutations may create binding site in other frame Modeling of a ranscription Factor Binding Site from High hroughput SELEX Data Using a Hidden Markov Modeling Approach Emmanuelle Roulet, Nicolas Mermod (Center for biotechnology UNIL- EPFL, Lausanne, Switzerland) Anamaria A Camargo, Andrew JG Simpson (Ludwig Institute of Cancer Research, Sao Paulo, Brazil) Philipp Bucher (Swiss Institute for Experimental Cancer Research and Swiss Institute of Bioinformatics, Epalinges s/lausanne, Switzerland) Nat. Biotechnol. 20, (2002)

Motivation and Goals of the Project Motivation: Accurate and reliable computational tools to predict transcription factor binding sites are still not available. Potential reasons: 1.

6 Motivation and Goals of the Project Motivation: Accurate and reliable computational tools to predict transcription factor binding sites are still not available. Potential reasons: 1. Lack of adequate experimental data 2. Lack of adequate computational models 3. Lack of an adequate method to estimate the parameters of a computational model from the experimental data Goal: o develop a combined computational-experimental protocol to derive an accurate predictive model of the sequence specificity of a DNA-binding protein Potential benefits: 1. Being able to predict transcription factor binding in genome sequences. 2. Insights into molecular mechanisms of sequence-specific protein-dna interactions 3. Ability to rationally design gene control regions of desired properties for biotechnological applications

7 Our Approach to the Problem of Characterizing the Sequence-Specificity of a DNA Binding ranscription Factor 1. Choice of a quantitative predictive model for representing the binding specificity. Our choice: a profile-hmm 2. Choice of an experimental method to generate data for estimating the model parameters. Our choice: a SELEX experiment 3. Choice of a machine learning algorithm to estimate the model parameters from the data. Our choice: the Baum-Welch HMM training algorithm 4. Validation of the approach and optimization of the experimental parameters by a computer simulation of step 2 and 3. Adjustment of experimental protocol to produce the necessary data as suggested by the computer simulation 6. Generation of the experimental data 7. Building a binding site model from the data 8. A posteriori validation of the model by cross-validation and comparison with independent experimental results Study Object: ranscription Factor CF/NFI Dimeric DNA-binding protein recognizing a palindromic sequence motif with consensus sequence GGC(N)GCCAA First isolated as a replication factor of Adenovirus type 2 Later independently isolated as a CCAA-box binding transcription factor Can activate transcription of a reporter gene in transfected cells Recently shown to be implicated in regulatory pathways related to tumor progression and immune response Biochemical mechanism of gene regulation still elusive

8 Old CF/NFI Binding Site Profile Example: GGGCAAAGCCAC Score: = 88

9 Random sequence library CCACCCGAGCGAGACA.N(2).AGACCCAACCGACCCGAA-3 Second strand synthesis by pcr Primer 1 Bgl II Bgl II CCACCCGAGCGAGACA.N(2).AGACCCAACCGACCCGAA-3 3 AGGAGAGAAGACAACAGACAGA.N(2).ACAGAGGAGGCGAGGCAAAA- Selection of binding sequences (gel shift) Amplification Primer 2 Selection cycles Digestion Bgl II GACA..N(2)..A A..N(2)..ACAG-3 Concatemerization and cloning -GACA N(2) AGACA N(2) AGACA N(2) A A N(2) ACAGA N(2) ACAGA N(2) ACAG-3 site 1 site 2 site 3 HS sequencing Principle of the Baum-Welch hidden Markov model training algorithm Initial model: raining sequences: AACAGCGGCCAACAGGACACA CCACAACFFACGCCCAAAAACCAA GAGGGACCGCCCAGCAAC ACACGGCACCCCACGC GGAAAAAAAAAAACAGGG GCGCGGAGGCACGCCCAA AAGGGCCACCAAAGCGAG... How does it work? 1. he initial model serves as current model. 2. raining sequences are aligned to the current model. 3. New base and transition frequencies are estimated from the multiple alignment generated by step 2. he new model becomes the current model. 4. Step 2 and 3 are repeated until convergence is reached. rained model:

12 Doing the Experiment

Results CF/NF1 Cycle 0 1 2 3 4 Cycle 0 1 2 3 4 SUM Seq.

2262 1678 172 8813 116 Site Statistics 1481 Colonies 427 3 447 1619 318 Diff. sites err < 0.

13 Results CF/NF1 Cycle Cycle SUM Seq.reads Sites Clone statistics Clones Different sites Site Statistics 1481 Colonies Diff. sites err < 0.01/bp err </bp Clones with detectable inserts New CF/NFI model Hidden Markov Model (frequencies given in %): Scoring profile (relative energy units):

14 Predicted and observed evolution of Selex populations heoretically predicted affinity profiles of successive SELEX cycles (Djordjevic & Sengupta 2006) high low affinity Weight matrix scores for successive CF/NF1 HP SELEX populations (Roulet et al. 2002) high Major Differences between New and Old CF/NFI Binding Site Models he new model contains a sixth half-site position reducing the major spacer length class to 3. his extends the consensus half-site motif to GGCA. Alternative spacer length classes N4 and N (N6 and N7 according to the old numbering system) receive much more severe penalties in the new profile. Based on the estimated frequencies, it is not certain whether these binding modes have occurred at all during SELEX amplification. he G mismatch at the first position of the half-site weigth matrix has a much lower weight in the new model.

16 Quality Assessment of the New Model: Comparison of Predicted Binding Scores with in vitro measured Binding Constants Data from Meisterernst et al. (1988). Nucl. Acids Res. 16,

Beyond simple weight matrices: correlated dinucleotide analysis HP SELEX Sequencing totals for members of the CF family SELEX Library LEF1_2 LEF1_3 LEF1_ LEF1_6 LEF1_7 SUM LBC_ LBC_6 SUM CF4_3 otal

17 Beyond simple weight matrices: correlated dinucleotide analysis HP SELEX Sequencing totals for members of the CF family SELEX Library LEF1_2 LEF1_3 LEF1_ LEF1_6 LEF1_7 SUM LBC_ LBC_6 SUM CF4_3 otal number of sites LEF1/CF-1 α with β-catenin otal number of unique sites LEF1/CF-1α CF % error rate <0.01% per bp <% per bp

PSSM of LEF1/CF-1α SELEX cycle 3 1 C 2 C 3 4 6 G 7 A 8 9 C 10 A A 0.093 0.013 0.018 0.002 0.004 0.014 0.968 0.14 0.011 0.042 C 0.411 0.81 0.019 0.00 0.003 0.034 0.004 0.62 0.080 G 0.292 0.093 0.003 0.00 0.936 0.

18 PSSM of LEF1/CF-1α SELEX cycle 3 1 C 2 C G 7 A 8 9 C 10 A A C G PSSM of LEF1/CF-1α SELEX cycle 6 1 C 2 C G 7 A 8 9 C 10 A A C G Base frequency tables for DNA binding sites of CF family members derived by HP SELEX

(2006). Cell 124:21. Motif obtained by competition assays with complete single base-substitution series.

19 Sequence Logos for binding sites of CF family proteins Lef-1 Lef-1/beta-catenin cf-4 Comparison of our CF4 binding site with motif obtained by affinity measurements Sequence Logo pasted from Hallikas et al. (2006). Cell 124:21. Motif obtained by competition assays with complete single base-substitution series. Note: at least one significant position is missing because of a priori restriction of motif extension.

Overview of HPSELEX Database Contents from raw data to HMMs: Single-read sequencing chromatograms Clone sequences (assembled by Phred/Phrap) Site sequences with estimated sequencing errors HMMs

20 Overview of HPSELEX Database Contents from raw data to HMMs: Single-read sequencing chromatograms Clone sequences (assembled by Phred/Phrap) Site sequences with estimated sequencing errors HMMs for binding sites in two formats (decodeanhmm, MAMO) Additional features: Quality-controlled sequence download Access to selected low-throughput SELEX data Experimental and computational protocols

21 Example of a HPSELEX clone entry ID LBC standard; DNA; UNC; 1023 BP. XX AC LBC XX D -Jun-200 XX DE ' Sequence of SELEX/SAGE Clone : LBC of cycle XX KW HP SELEX/SAGE, invitro transcription factor binding sites XX OS unidentified OC unidentified XX RN [1] RA Emmanuelle Roulet, Stephane Busso, Anamaria A.Camargo, Andrew J.G Simpson, RA Nicolas Mermod, and Philipp Bucher. R High-throughput SELEX-SAGE method for quantitative modelling of R transcription-factor binding sites. RL Nature Biotechnology 20:831-83(2000) XX DR RACES;LBC 003F.scf XX FH Key Location/Qualifiers FH F source F /mol_type="unassigned DNA" F /organism="unidentified" F /tissue_type="selex" F misc_binding F /bound_moiety ="LEF1/CF with beta catenin " F /label="lbc 00003_1" F /note="base quality score is e-03" F misc_binding F /bound_moiety ="LEF1/CF with beta catenin " F /label="lbc 00003_2" F /note="base quality score is e-03" XX SQ Sequence 1023 BP; 230 A; 291 C; 260 G; 242 ; 0 other; AAAACCAA AAAGGGGCA GAAGGGCC CCCGAGC GCCGAGCG GCCGCCAGG GAGGAA CGCAGAA CCAGCACAC GGCGGCCG ACAGGGA CAGGCGG

Computational Analysis of Ultra-high-throughput sequencing data: ChIP-Seq

Computational Analysis of Ultra-high-throughput sequencing data: ChIP-Seq Philipp Bucher Wednesday January 21, 2009 SIB graduate school course EPFL, Lausanne Data flow in ChIP-Seq data analysis Level 1: