Advanced topics in bioinformatics

Size: px
Start display at page:

Download "Advanced topics in bioinformatics"

Transcription

1 Feinberg Graduate School of the Weizmann Institute of Science Advanced topics in bioinformatics Shmuel Pietrokovski & Eitan Rubin Spring 2003 Course WWW site:

2 Lecture 6, 9/4/2003: Advanced block usage - LAMA and CYRCA 2

3 Progression of sequence analysis methods Protein sequences Sequence alignment Protein families Multiple sequence alignment Protein Motifs (blocks) Motif to sequences and Sequence to motifs alignments 3

4 Progression of sequence analysis methods Protein sequences Sequence alignment Protein families Multiple sequence alignment Protein Motifs (blocks) Block to block Block sets alignment - pairwise and multiple-block alignments 4

5 Pairwise alignment of blocks X X X X X C S C C S A A R N D C Q E G H I L K M F P S T W Y V

6 Column comparison measures i) Normalized Euclidean distance (d): d(a,b)= This largest possible distance occurs when only one different amino acid appears in each column. ii) Pearson s correlation coefficient (r): r(a,b) = 20 (Ai-Bi) 2 i= (A i -A)*(B i -B) i=1 20 (Ai-A) 2 20 * (Bi -B) 2 i=1 i=1 A i and B i are the values in columns A and B. 2 S The distance scores range from 0, for identical columns, to 2 where S is the sum of 20 the values in each column. A and B are the means of the values in columns A and B. The correlation scores range from 1, for columns with identical value distributions, to -1 for columns with opposite value distributions (in each column only 10 amino acids occur and those 10 amino acids are different in the two compared columns). iii) Spearmann rank correlation coefficient (rho): this is identical to ii) except that the ranks of the amino acid values are used instead of the values themselves. 20 p(a,b) = A i *B i iv) Sum of products of corresponding amino acids (p): i=1 This measure is an extrapolation of scoring a residue against a PSSM column. The scores of this measure range from S 2, when a single identical amino acid occurs in both 6 columns, to 0 when the columns share no common amino acid.

7 Column comparison measures TPs below Rank of Measure TNs a above Equivalence ROC area d % b % b lowest TP number c d r rho p Alignments were global, i.e., the shorter block was slid across the longer block. The total number of scores was 5.4 million and the number of true positives (TPs) 293. a True negatives (TNs) possibly include uncatalogued TPs. b The rank of the % is 270. c Equivalence number is the number of TPs below the rank where it is equal to the number of TNs above that rank. The number is 0 when all the TPs are above the TNs. d Receiver operating curve (ROC)shows the number of TNs as the x-axis and the number of TPs above that TN as the y-axis. The area under the curve is 1 when all the TPs are above the TNs. 7

8 5*108 Distribution of r block column scores All blocks in BLOCKS 8.0 (4.48*10 9 pairs of columns) 4*108 3*108 2*108 1* Score

9 LAMA block to block alignment Over view of the method Each multiple alignment is treated as a sequence of amino acid distributions. Multiple alignments can then be aligned with each other by using an appropriate measure for scoring the similarity between amino acid distributions (analogous to the use of an amino acid substitution matrix in sequence-to-sequence alignment). The Smith-Waterman algorithm is used for local alignment of the multiple-alignments, with no gaps allowed. aa distributions aa distributions Segment from block A Alignment score between segments from blocks A and B - S A 2-5 B6-9 = S A2 B6 + S A3 B7 + S A4 B8 + S A5 B Segment from block B 9

10 LAMA block to block alignment Estimating the significance of alignment scores To calibrate the LAMA scores the Blocks Database was purged from biassed blocks, the PSSMs of the remaining blocks were each shuffled and then compared against the blocks in the unshuffled database. The best score from each of the resulting 7 million comparisons was saved. These scores are due to chance and were used to estimate the significance of alignment scores between blocks. The mean and variance of chance alignments depend on the length of the compared blocks. Longer blocks will give longer alignments and higher scores by chance alone. Grouping the chance scores by the length of the shorter block in each comparison gave very similar score distributions. The mean and standard deviation of each group was used to transform each score into a Z score. The percentiles of all these Z scores was then calculated. These percentiles are used to estimate the expected number each score should appear not due to genuine relationship. 10

11 LAMA block to block alignment 1.0 Shuffled-blocks scores, 52 partitions (grouping by length of shorter block) Percent ile Z score (m ean + Z*std)

12 LAMA block to block alignment Single hits, Z-score 8.3 : Multiple hits, Z-score 5.6 Independent : family A family B Repeats: Inner repeats: Blocks database blocks ~5 millio n pairwise comparisons (3 174*31 73/2) 1136 pairs wit h Z-scores pairs (1/3 ) in hits - 69 pairs in 69 single hi ts 70 pairs in 25 independent m ultiple hits 60 pairs in 26 repeat m ultiple hits 183 pairs in 21 i nner-repeat multiple hit s 12

13 LAMA block to block alignment Dist ribut ion of t op scoring f amily pairs - Relation t ype Genuine Biassed Composition Unknown Tot al Multiple block hits- independent repeats inner repeats Single block hits Total Fraction 80% 8% 12% Genuine relations were identified by the families descriptions, by detailed analysis of the literature or by sharing common sequences (22 of the single and independent-multiple hits). 13

14 LAMA Local Alignment of Multiple Alignments Compares blocks with blocks. An extremely sensitive methodcan find relationships undetected by other programs. 14

15 CYRCA CYclical Relations Consistency Analysis Input - Pairs of possibly related blocks identified with low stringency alignment cutoff. Output - Sets of consistently aligned blocks. Approach - Consistent alignments include three or more blocks, that have a transitive relation,and are unambiguously aligned across the same region. 15

16 CYRCA Block A Consistent alignments x A Block B Block B y y B C Block C Block C z z Block A x 16

17 CYRCA Consistent alignments A Set 1 A L B C B C M N B D C D C E D Set 2 E L M N 17

18 Data condensation in progressive sequence analysis methods SwissProt database Blocks+ database LAMA output CYRCA output Protein sequences Protein families Blocks Possibly related block pairs Block sets (block pairs in these sets) 80,000 2,334 10,532 5, (663)

19 HTH and HhH DNA binding domains Helix-hairpin-helix DNA binding domain Helix-turn-helix DNA binding domain

20 HTH and HhH DNA binding domains E.coli AlkA 3-methyladenine DNA glycosylase II MKTLQTFPGIGRWTA Phage LAMBDA CI repressor QESVADKMGMGQSGV

21 HTH and HhH DNA binding domains QESVADKMGMGQSGV MKTLQTFPGIGRWTA * * RMSD A E.coli AlkA 3-methyladenine DNA glycosylase II MKTLQTFPGIGRWTA Phage LAMBDA CI repressor QESVADKMGMGQSGV

22 More details, sources and things to do for next class Sources: Pietrokovski, S. "Searching Databases of Conserved Sequence Regions by Aligning Protein Multiple-Alignments" Nucleic Acids Research 24: (1996). Kunin V, Chan B, Sitbon E, Lithwick G, & Pietrokovski S "Consistency analysis of similarity between multiple alignmentsprediction of protein function and fold structure from analysis of local sequence motifs" J Molecular Biology 307: (2001). 21

23 More details, sources and things to do for next class Assignment: Read the source articles. What is common and what is different between the LAMA MSAto-MSA comparison method and the procedure used in the clustalw method? CYRCA creates multiple alignment of blocks in which ways is it similar and in which is it different than the methods to multiple alignment of sequences we discussed? 22