Protein Structural Motifs Search in Protein Data Base Virginio Cantoni 1, Alessio Ferone 2, Ozlem Ozbudak 3, Alfredo Petrosino 2 1 Dept. of Electrical Engineering and Computer Science, Pavia Univ., Italy 2 Dept. of Applied Science. Univ. Naples Parthenope, Itally 3 Dept. of Electronics and Communication Engineering, Istanbul Tech. Univ., Turkey
PDB 2
Protein Data Bank (PDB) http://www.rcsb.org/pdb/ 3
Levels of protein structure representation Primary structure Secondary structure Tertiary structure Quaternary structure 4
Primary structure: the sequence of amino acids 5
Secondary structures Three basic components: helix sheet Loops (linear connections between the components) 6
The helices One of the most closely packed arrangement of residues. ~40% of residues in globular proteins 7
The sheet loosely packed arrangement of residues. Parallel Antiparallel Twisted 8
Secondary Structures Representation Secondary structures are represented as linear vectors (segments): the axis for the helix and the best fit segment for a sheet An alignment algorithm is used to match an helix segments with known axes to determine helix axis. Direct segment fits are made to fit sheet strands. 9
Secondary Structure Determination Programs: DSSP and STRIDE. On the average 4.8% of the target residues were differently assigned, this number reaching 12% for certain targets. 10
Protein Structure Comparison What are the most similar folds? PDB New protein 11
Secondary structure representation Each secondary structure is displayed as a cylinder The protein is represented by and ordered sequence of cylinder with two labels: helices or sheets 12
GHT applied to proteins For every protein, the distance ( ) of every secondary structure from a reference point (RP, eg the geometric center of the protein) and the angle (theta) between the direction of the secondary structure in the 3D space and the segment linking the center of that secondary structure with the RP are first calculated. (GH reference table RT) 13
In the way of GHT (simplified 2D representation) helices and sheets Query protein (scaled 0.5) Mapping Rule Votes Space 14
In the way of GHT helices and sheets Query protein Mapping Rule Votes Space 15
Generalized Hough Transform (SSS) Reference Point A Type A: -helix, l 1,TD.. 16
PROTEIN 1FNB The protein contains 22 Secondary Structure. Searched motif: Greek key (4 -sheets). The red circles are the helices and the blue circles are the sheets. The cyan blue triangles indicate the orientation of the secondary structures. The black point is the reference point. 17
PROTEIN 7FAB The protein contains 46 Secondary Structure. Searched motif: 3 helix and 2 sheet. The red circles are the helices and the blue circles are the sheets. The cyan blue triangles indicate the orientation of the secondary structures. The black point is the reference point. 18
SSC: Secondary Structures Co-occurrences RP Axis angle B Coplanar lines Axis distance Midpoint distance A Type A: -helix, l 1, TD.. Type B: -helix, l 2, TP.. 19
SST: Secondary Structures Triplets Reference Point C normal to ABC : l AB, l BC, l CA B : ABC A Type A: -helix, l 1 Type B: -helix, l 2 Type C: -helix, l 3 20
4 SSs motif: Terns co-occurrence Reference Point FourTerns ABC ACD BCD DAB C D B Type A: -helix, l 1 Type B: -strand, l 2 Type C: -helix, l 3 Type D: -strand, l 4 A 21
Reference Point 5 SSs motif: Terns co-occurrence Ten Terns ABC ACD BCD BDE CDE CEA DEA DAB EAB EBC E C D B Type A: -helix, l 1 Type B: -helix, l 2 Type C: -helix, l 3 Type D: -helix, l 4 Type E: -helix, l 5 A 22
PROTEIN 1FNB 40 Motif RP RP for SSC RP for SST RP for MDM 30 20 z 10 0-50 -10-10 0 10 20 30 40 50 The protein contains 22 Secondary Structure. Searched motif: Greek key (4 -sheets). The red circles are the helices and the blue circles are the sheets, in bold the motif SSs. 23 x 50 0 y
PROTEIN 7FAB 50 Motif RP RP for SSC RP for SST RP for MDM 40 30 20 z 10 0-10 -20-60 -40-20 x 0 20 40 20 y 0-20 The protein contains 46 Secondary Structure. Searched motif: 3 helices and 2 sheets. The red circles are the helices and the blue circles are the sheets, in bold the motif SSs. 24
Searching performances Searching a Greek Key motif (4 SSs, all -sheets) in 1FNB Searching a motif with 5 SSs (3 helices and 2 sheets) in 7FAB 25
PV Benchmark (20 proteins) 26
PV Benchmark: basic features 10000000 Number of candidate motifs 1000000 100000 10000 1000 3ss 4ss 5ss 100 10 20 30 40 50 100 Number of Secondary Structures 27
PV Benchmark: performances 50 5SS 4SS 3SS Searching time (msec) 5 SST 0,5 10 SSC 20 30 40 50 Number of Secondary Structures 28
Average performances SSC Number of proteins Number of SSs per motif Total number of motifs Total Searching Time (sec) Average Searching Time per motif (msec) 20 3 105971 119.882 1.1 20 4 918470 1275.585 1.4 20 5 6455009 11261.911 1.7 SST Number of proteins Number of SSs per motif Total number of motifs Total Searching Time (sec) Average Searching Time per motif (msec) 20 3 105971 768.508 7.3 20 4 918470 10303.806 11.2 20 5 6455009 111809.428 17.3 29