Exercises (Multiple sequence alignment, profile search)

Size: px
Start display at page:

Download "Exercises (Multiple sequence alignment, profile search)"

Transcription

1 Exercises (Multiple sequence alignment, profile search) 8. Using Clustal Omega program, available among the tools at the EBI website ( calculate a multiple alignment of five homologous Cas9 proteins from the following organisms: Streptococcus pyogenes (Accession NP_269215), Streptococcus thermophilus (WP_ ), Staphylococcus aureus (CCK74173), Staphylococcus lugdunensis (WP_ ) and Halalkalibacillus halophilus (WP_ ). What is the length of the longest stretch of amino acid residues completely conserved in all 5 sequences? Give this motif in single letter code. (NB. The easiest input for Clustal Omega is a FASTA file of sequences, prepared in advance. For this assignment, default parameters of Clustal Omega are sufficient) 9. Using the database of protein profiles PROSITE ( determine whether the amino acid sequence vkpklplipghegvgvieevgpgvt contains some consensus pattern. Give the motif description of this consensus pattern. 10. Using the database of protein profiles PROSITE ( determine the positions of conserved motifs (profiles) in one of human proteins involved in RNA splicing, 9G8 (Accession NP_ ). Using the resources of ENTREZ system, determine how many exons are in the 9G8 gene, and which of these exons contain the sequences encoding the found motifs. 11. Using the database Gene of ENTREZ, retrieve the sequences of the shortest and longest isoforms of the human enzyme ADAR adenosine deaminase (GeneID 103). What are the lengths of the amino acid sequences of these isoforms? Using the database of protein profiles PROSITE ( determine the positions and the function of a conserved motif contained in the longer isoform, but absent in the shortest one. 12. Human gene ADAM15 (Gene ID:8751) contains 24 exons. Several protein isoforms are produced by alternative splicing that may lead to a shift in the translation frame. For instance, a variant encoded by transcript NM_ is annotated as lacking three exons in the coding region compared to the longest isoform (transcript NM_ ). Identify nucleotide positions of the exons in the longest isoform, which are skipped in the shorter one. Which of these three exons causes a translation frameshift? What are the lengths of C-terminal amino acid sequences that are different in the encoded proteins? Using the database PROSITE ( determine what patterns can be located in these distinct C-termini (NB. in PROSITE search, in contrast to default option, do not exclude patterns with a high probability of occurrence).

2 8. Using Clustal Omega program, available among the tools at the EBI website ( calculate a multiple alignment of five homologous Cas9 proteins from the following organisms: Streptococcus pyogenes (Accession NP_269215), Streptococcus thermophilus (WP_ ), Staphylococcus aureus (CCK74173), Staphylococcus lugdunensis (WP_ ) and Halalkalibacillus halophilus (WP_ ). What is the length of the longest stretch of amino acid residues completely conserved in all 5 sequences? Give this motif in single letter code. (NB. The easiest input for Clustal Omega is a FASTA file of sequences, prepared in advance. For this assignment, default parameters of Clustal Omega are sufficient) FASTA file: >Cas9_S_lugdunensis MNQKFILGLDIGITSVGYGLIDYETKNIIDAGVRLFPEANVENNEGRRSKRGSRRLKRRRIHRLERVKKL LEDYNLLDQSQIPQSTNPYAIRVKGLSEALSKDELVIALLHIAKRRGIHKIDVIDSNDDVGNELSTKEQL NKNSKLLKDKFVCQIQLERMNEGQVRGEKNRFKTADIIKEIIQLLNVQKNFHQLDENFINKYIELVEMRR EYFEGPGKGSPYGWEGDPKAWYETLMGHCTYFPDELRSVKYAYSADLFNALNDLNNLVIQRDGLSKLEYH EKYHIIENVFKQKKKPTLKQIANEINVNPEDIKGYRITKSGKPQFTEFKLYHDLKSVLFDQSILENEDVL DQIAEILTIYQDKDSIKSKLTELDILLNEEDKENIAQLTGYTGTHRLSLKCIRLVLEEQWYSSRNQMEIF THLNIKPKKINLTAANKIPKAMIDEFILSPVVKRTFGQAINLINKIIEKYGVPEDIIIELARENNSKDKQ KFINEMQKKNENTRKRINEIIGKYGNQNAKRLVEKIRLHDEQEGKCLYSLESIPLEDLLNNPNHYEVDHI IPRSVSFDNSYHNKVLVKQSENSKKSNLTPYQYFNSGKSKLSYNQFKQHILNLSKSQDRISKKKKEYLLE ERDINKFEVQKEFINRNLVDTRYATRELTNYLKAYFSANNMNVKVKTINGSFTDYLRKVWKFKKERNHGY KHHAEDALIIANADFLFKENKKLKAVNSVLEKPEIESKQLDIQVDSEDNYSEMFIIPKQVQDIKDFRNFK YSHRVDKKPNRQLINDTLYSTRKKDNSTYIVQTIKDIYAKDNTTLKKQFDKSPEKFLMYQHDPRTFEKLE VIMKQYANEKNPLAKYHEETGEYLTKYSKKNNGPIVKSLKYIGNKLGSHLDVTHQFKSSTKKLVKLSIKP YRFDVYLTDKGYKFITISYLDVLKKDNYYYIPEQKYDKLKLGKAIDKNAKFIASFYKNDLIKLDGEIYKI IGVNSDTRNMIELDLPDIRYKEYCELNNIKGEPRIKKTIGKKVNSIEKLTTDVLGNVFTNTQYTKPQLLF KRGN >Cas9_H_halophilus MKQFENNYTLGLDIGIGSVGWGLVDEDQNIIDSGVRLFPEADVNNNTGRRGFRGARRLLRRRRHRLERIK MLLSNANLPTNQDKANAEETPYHIRVKGLTEKLSEEELSQALLHLGKRRGIHNVEVAEDETGGNELSTKD QLNQNAKALKNQYVCEVQLNRLENEGEVRGHRNRFKTSDYVAEARQLLSIQQKYHSKVTDEFIDQYLELI EKRREYYEGPGFGSEYGWEQDRQKWYEQMMGRCSYYPEELRSVKEAYSAQLFNVLNDLNNLVLTRDEDHK LSTEEKEELVEKVFKKYKSPKLNKIAKVLELKEDDIKGYRVTSKGTAEFTPLKIYHDLLGITDKKEVLED EDALDEIAEILTIYQTPSDIKEELEKLDLPLNKVDIESISELSSYSQTHSLSLKLIHQVIPDLWATPKNQ MQLFTENGIKPKKIKLEGKKYIPFHHLDEWILSPVVKRSFKQSIRIVNEIRKQYGEPKEIVIELARENSS DDKKNFLKELNKKNRAVNEAVMEKLESKDLEPKKGMFNKLRLWHIQDGLCMYSLKPIQIEDLLSNPTNYE IDHILPRSVSFDDSQKNKVLVHTEENQKKGNETPYQYLSSGKGHVSYEKYKSHVLQLAKSRDKMPKKKVE YLLEERDINKYDIQKEFINRNLVDTRYATRGLLTLLTTFFSENNKDVKVKAINGAFTDFLRKTWDFKKDR GADFKHHAEDALIVAMAGYLFQHQRELKKHNILLTEGKNGEEKTIDKETGEILEEKTFVNSFTERMDKVK AIKNYPNYKYSHKVDMKPNRQLMNDTLYSTRKVDDKEFVIEKIKDLYDKDQDKLVKQIKKDPTKLLMYHH DPQTYKKIERAIEQYSDAKNPLHKMYEETGEYLRKYSKKGNGPIIKSVKYYGNSLKEHKDVSHKFNTKDK KVVNLSLKPFRMDVYEDDGVYKFVTVSYKDLIEEKENYRINNDVYLQKLINKKIANKDGFVFSLYKNDVC KINGEYFRLIGVNHDEGNRIEMNKINYHYKDYAERNEIKQNRIYKGISKNTNEFIKIHTDVLGNVYFNSV EKFKSMYQK >Cas9_S_thermophilus etc...

3 8. Using Clustal Omega program, available among the tools at the EBI website ( calculate a multiple alignment of five homologous Cas9 proteins from the following organisms: Streptococcus pyogenes (Accession NP_269215), Streptococcus thermophilus (WP_ ), Staphylococcus aureus (CCK74173), Staphylococcus lugdunensis (WP_ ) and Halalkalibacillus halophilus (WP_ ). What is the length of the longest stretch of amino acid residues completely conserved in all 5 sequences? Give this motif in single letter code. (NB. The easiest input for Clustal Omega is a FASTA file of sequences, prepared in advance. For this assignment, default parameters of Clustal Omega are sufficient) CLUSTAL O(1.2.1) multiple sequence alignment Cas9_S_pyogenes ---MDKKYSIGLDIGTNSVGWAVITDEYKVPSKKFKVLGNTDRHSIKKNLIGALLFDS-- 55 Cas9_S_thermophilus ----MSDLVLGLDIGIGSVGVGILNKVTGEIIH KNSRIFPAAQ 39 Cas9_H_halophilus MKQFENNYTLGLDIGIGSVGWGLVDE-DQNIID SGVRLFPEAD 42 Cas9_S_lugdunensis ---MNQKFILGLDIGITSVGYGLIDYETKNIID AGVRLFPEAN 40 Cas9_S_aureus ---MKRNYILGLDIGITSVGYGIIDYETRDVID AGVRLFKEAN 40. :***** ***.::. :* Cas9_S_pyogenes GETAEATRLKRTARRRYTRRKNRICYLQEIFSNEMAKVDDSFFHRLEESFLVEEDKKHER 115 Cas9_S_thermophilus AENNLVRRTNRQGRRLTRRKKHRIVRLNRLFEESGLITDFT Cas9_H_halophilus VNNNTGRRGFRGARRLLRRRRHRLERIKMLLSNANLPTNQD Cas9_S_lugdunensis VENNEGRRSKRGSRRLKRRRIHRLERVKKLLEDYNLLDQSQ Cas9_S_aureus VENNEGRRSKRGARRLKRRRRHRIQRVKKLLFDYNLLTDHS :. * *.** *:.*: :: :: : : Cas9_S_pyogenes HPIFGNIVDEVAYHEKYPTIYHLRKKLVDSTDKADLRLIYLALAHMIKFRGHFLIEGDLN 175 Cas9_S_thermophilus KISINLNPYQLRVK--GLTDELSNEELFIALKNMVKHRGISYLDDAS- 125 Cas9_H_halophilus KANAEETPYHIRVK--GLTEKLSEEELSQALLHLGKRRGIHNVEVAE- 128 Cas9_S_lugdunensis IPQSTNPYAIRVK--GLSEALSKDELVIALLHIAKRRGIHKIDVIDS 126 Cas9_S_aureus ELSGINPYEARVK--GLSQKLSEEEFSAALLHLAKRRGVHNVNEV * * * ::. : **.: * ** :: Cas9_S_pyogenes PDNSDVDKLFIQLVQTYNQLFEENPINASGVDAKAILSARLSKSRRLENLIAQLPGEKKN 235 Cas9_S_thermophilus -DDGN------SSVGDYAQIVKENSK QLE---TKTPGQIQL 156 Cas9_H_halophilus DETGG------NELST-KDQLNQNAK ALK---NQYVCEVQL 159 Cas9_S_lugdunensis NDDVG------NELST-KEQLNKNSK LLK---DKFVCQIQL 157 Cas9_S_aureus EEDTG------NELST-KEQISRNSK ALE---EKYVAELQL 155 :. : :...* *: : : :...(complete alignment not shown)

4 9. Using the database of protein profiles PROSITE ( determine whether the amino acid sequence vkpklplipghegvgvieevgpgvt contains some consensus pattern. Give the motif description of this consensus pattern.

5 9. Using the database of protein profiles PROSITE ( determine whether the amino acid sequence vkpklplipghegvgvieevgpgvt contains some consensus pattern. Give the motif description of this consensus pattern.

6 9. Using the database of protein profiles PROSITE ( determine whether the amino acid sequence vkpklplipghegvgvieevgpgvt contains some consensus pattern. Give the motif description of this consensus pattern.

7 10. Using the database of protein profiles PROSITE ( determine the positions of conserved motifs (profiles) in one of human proteins involved in RNA splicing, 9G8 (Accession NP_ ). Using the resources of ENTREZ system, determine how many exons are in the 9G8 gene, and which of these exons contain the sequences encoding the found motifs. >9G8 MSRYGRYGGETKVYVGNLGTGAGKGELERAFSYYGPLRTVWIARNPPGFAFVEFEDPRDAEDAVRGLDGK VICGSRVRVELSTGMPRRSRFDRPPARRPFDPNDRCYECGEKGHYAYDCHRYSRRRRSRSRSRSHSRSRG RRYSRSRSRSRGRRSRSASPRRSRSISLRRSRSASLRRSRSGSIKGSRYFQSPSRSRSRSRSISRPRSSR SKSRSPSPKRSRSPSGSPRRSASPERMD

8 10. Using the database of protein profiles PROSITE ( determine the positions of conserved motifs (profiles) in one of human proteins involved in RNA splicing, 9G8 (Accession NP_ ). Using the resources of ENTREZ system, determine how many exons are in the 9G8 gene, and which of these exons contain the sequences encoding the found motifs. In the datafields of NP_ : CDS /gene="srsf7" /gene_synonym="9g8; AAG3; SFRS7" /coded_by="nm_ : " /note="isoform 1 is encoded by transcript variant 1" /db_xref="ccds:ccds " /db_xref="geneid:6432" (Scroll down) In the datafields of NM_ : features: exon CDS

9 10. Using the database of protein profiles PROSITE ( determine the positions of conserved motifs (profiles) in one of human proteins involved in RNA splicing, 9G8 (Accession NP_ ). Using the resources of ENTREZ system, determine how many exons are in the 9G8 gene, and which of these exons contain the sequences encoding the found motifs. Arithmetics of coding sequences: RRM motif: x10 = x84 = 490 => coded by exons 2 & 3 ZF_CCHC: x104= x119= 595 => coded by exon 3 Or BLAST 2 sequences: RRM motif: tblastn NP_ pos vs. NM_ : pos : exons 2 & 3 ZF_CCHC: tblastn NP_ pos vs. NM_ : pos : exon 3 In the datafields of NM_ : features: exon CDS

10 11. Using the database Gene of ENTREZ, retrieve the sequences of the shortest and longest isoforms of the human enzyme ADAR adenosine deaminase (GeneID 103). What are the lengths of the amino acid sequences of these isoforms? Using the database of protein profiles PROSITE ( determine the positions and the function of a conserved motif contained in the longer isoform, but absent in the shortest one. Gene -> mrna and Proteins -> NP_ (1226 amino acids) is the longest, NP_ (931 aa) is the shortest. Scan Prosite with 931 aa: Scan Prosite with 1226 aa: PS50139 DRADA_REPEAT DRADA repeat profile : 1-65: score = MAEIKEKICDYLFNVSDSSALNLAKNIGLTK-ARDINAVLIDMERQGDVYRQGTTPP IWHLTDKKR PS50137 DS_RBD Double stranded RNA-binding domain (dsrbd) profile : : score = NPISGLLEYAQFASQTCEFNMIEQSGPPHEPRFKFQVVINGREFPPAEAGSKKVAKQDAA MKAMTILLE : score = SPVTTLLECMHKLGNSCEFRLLSKEGPAHEPKFQYCVAVGAQTFPSVSAPSKKVAKQMAA EEAMKALHG : score = NPVGGLLEYARSHGFAAEFKLVDQSGPPHEPKFVYQAKVGGRWFPAVCAHSKKQGKQEAA DAALRVLIG PS50141 A_DEAMIN_EDITASE Adenosine to inosine editase domain profile : : score = SLGTGNRCVKGDSLSLKGETVNDCHAEIISRRGFIRFLYSELMKYNSQTAKDSIFEPAKG GEKLQIKKTVSFHLYISTAPCGDGALFDksCSDRAMESTESRHYPVFENPKQGKLRTKVE NGEGTIPVESSDIVPTWDGIRLGERLRTMSCSDKILRWNVLGLQGALLTHFLQPIYLKSV TLGYLFSQGHLTRAICCRVTRdgsAFEDGLRHPFIVNHPKVGRVSIYDSKRQSGKTKETS VNWCLADGyDLEILDGTRGTVDGPRNELSRVSKKNIFLLFKKLCSFRYRRDLLRLSYGEA KKAARDYETAKNYFKKGLKDMGYGNWISKPQEEKNF PS50139 DRADA_REPEAT DRADA repeat profile : : score = LSIYQDQEQRILKFLEELGEgKATTAHDLSGKLGTPKKEINRVLYSLAKKGKLQKEAGTP PLWKIAVSTQ : score = FLDMAEIKEKICDYLFNVSDSSALNLAKNIGLTK-ARDINAVLIDMERQGDVYRQGTTPP IWHLTDKKR PS50137 DS_RBD Double stranded RNA-binding domain (dsrbd) profile : : score = NPISGLLEYAQFASQTCEFNMIEQSGPPHEPRFKFQVVINGREFPPAEAGSKKVAKQDAA MKAMTILLE : score = SPVTTLLECMHKLGNSCEFRLLSKEGPAHEPKFQYCVAVGAQTFPSVSAPSKKVAKQMAA EEAMKALHG : score = NPVGGLLEYARSHGFAAEFKLVDQSGPPHEPKFVYQAKVGGRWFPAVCAHSKKQGKQEAA DAALRVLIG PS50141 A_DEAMIN_EDITASE Adenosine to inosine editase domain profile : : score = SLGTGNRCVKGDSLSLKGETVNDCHAEIISRRGFIRFLYSELMKYNSQTAKDSIFEPAKG GEKLQIKKTVSFHLYISTAPCGDGALFDksCSDRAMESTESRHYPVFENPKQGKLRTKVE NGEGTIPVESSDIVPTWDGIRLGERLRTMSCSDKILRWNVLGLQGALLTHFLQPIYLKSV TLGYLFSQGHLTRAICCRVTRdgsAFEDGLRHPFIVNHPKVGRVSIYDSKRQSGKTKETS VNWCLADGyDLEILDGTRGTVDGPRNELSRVSKKNIFLLFKKLCSFRYRRDLLRLSYGEA KKAARDYETAKNYFKKGLKDMGYGNWISKPQEEKNF From Prosite description: DRADA repeats are exclusively present in the double stranded-specific adenosine deaminase (DRADA) family. This enzyme deaminates multiple adenosines to inosines by a hydrolytic deamination reaction only on double-stranded RNA...

11 12. Human gene ADAM15 (Gene ID:8751) contains 24 exons. Several protein isoforms are produced by alternative splicing that may lead to a shift in the translation frame. For instance, a variant encoded by transcript NM_ is annotated as lacking three exons in the coding region compared to the longest isoform (transcript NM_ ). Identify nucleotide positions of the exons in the longest isoform, which are skipped in the shorter one. Which of these three exons causes a translation frameshift? What are the lengths of C-terminal amino acid sequences that are different in the encoded proteins? Using the database PROSITE ( determine what patterns can be located in these distinct C-termini (NB. in PROSITE search, in contrast to default option, do not exclude patterns with a high probability of occurrence). Three missing exons are seen in the graphic output: Alignment (e.g. BLAST 2 sequences) NM_ vs. NM_ : nucleotide positions of the missing exons are identified. Parts of the alignment: Query 2281 CCAGCGACTCTGCCAGCTCAAGGGACCCACCTGCCAGTACAG 2322 (NM_ ) Sbjct 2281 CCAGCGACTCTGCCAGCTCAAGGGACCCACCTGCCAGTACAG 2322 (NM_ ) Query 2538 AGTCTCAGGGGCCAGCCAAGCCCCCACCCCCAAGGAAGCCACTGCCTGCCGACCCCCAGG 2597 Sbjct 2321 AGTCTCAGGGGCCAGCCAAGCCCCCACCCCCAAGGAAGCCACTGCCTGCCGACCCCCAGG 2380 In the annotation of NM_ : features: exon => 70 nucleotides exon => 75 nucleotides exon => 72 nucleotides The 70-nt exon deletion causes a translation frameshift (70 3n)

12 12. Human gene ADAM15 (Gene ID:8751) contains 24 exons. Several protein isoforms are produced by alternative splicing that may lead to a shift in the translation frame. For instance, a variant encoded by transcript NM_ is annotated as lacking three exons in the coding region compared to the longest isoform (transcript NM_ ). Identify nucleotide positions of the exons in the longest isoform, which are skipped in the shorter one. Which of these three exons causes a translation frameshift? What are the lengths of C-terminal amino acid sequences that are different in the encoded proteins? Using the database PROSITE ( determine what patterns can be located in these distinct C-termini (NB. in PROSITE search, in contrast to default option, do not exclude patterns with a high probability of occurrence). Proteins encoded by long and short isoforms: NP_ aa 1 mrlallwalg llgagsplps wplpniggte eqqaesekap replepqvlq ddlpislkkv 61 lqtslpeplr ikleldgdsh ilellqnrel vpgrptlvwy qpdgtrvvse ghtlenccyq 121 grvrgyagsw vsictcsglr glvvltpers ytleqgpgdl qgppiisriq dlhlpghtca 181 lswresvhtq kppehplgqr hirrrrdvvt etktvelviv adhseaqkyr dfqhllnrtl 241 evallldtff rplnvrvalv gleawtqrdl veispnpavt lenflhwrra hllprlphds 301 aqlvtgtsfs gptvgmaiqn sicspdfsgg vnmdhstsil gvassiahel ghslgldhdl 361 pgnscpcpgp apaktcimea stdflpglnf sncsrralek alldgmgscl ferlpslppm 421 aafcgnmfve pgeqcdcgfl ddcvdpccds ltcqlrpgaq casdgpccqn cqlrpsgwqc 481 rptrgdcdlp efcpgdssqc ppdvslgdge pcaggqavcm hgrcasyaqq cqslwgpgaq 541 paaplclqta ntrgnafgsc grnpsgsyvs ctprdaicgq lqcqtgrtqp llgsirdllw 601 etidvngtel ncswvhldlg sdvaqplltl pgtacgpglv cidhrcqrvd llgaqecrsk 661 chghgvcdsn rhcyceegwa ppdcttqlka tsslttglll sllvllvlvm lgasywyrar 721 lhqrlcqlkg ptcqyslrgq psphpqgshc lptpragahr vtcpaqgles rp => 37 amino acids NP_ aa 1 mrlallwalg llgagsplps wplpniggte eqqaesekap replepqvlq ddlpislkkv 61 lqtslpeplr ikleldgdsh ilellqnrel vpgrptlvwy qpdgtrvvse ghtlenccyq 121 grvrgyagsw vsictcsglr glvvltpers ytleqgpgdl qgppiisriq dlhlpghtca 181 lswresvhtq kppehplgqr hirrrrdvvt etktvelviv adhseaqkyr dfqhllnrtl 241 evallldtff rplnvrvalv gleawtqrdl veispnpavt lenflhwrra hllprlphds 301 aqlvtgtsfs gptvgmaiqn sicspdfsgg vnmdhstsil gvassiahel ghslgldhdl 361 pgnscpcpgp apaktcimea stdflpglnf sncsrralek alldgmgscl ferlpslppm 421 aafcgnmfve pgeqcdcgfl ddcvdpccds ltcqlrpgaq casdgpccqn cqlrpsgwqc 481 rptrgdcdlp efcpgdssqc ppdvslgdge pcaggqavcm hgrcasyaqq cqslwgpgaq 541 paaplclqta ntrgnafgsc grnpsgsyvs ctprdaicgq lqcqtgrtqp llgsirdllw 601 etidvngtel ncswvhldlg sdvaqplltl pgtacgpglv cidhrcqrvd llgaqecrsk 661 chghgvcdsn rhcyceegwa ppdcttqlka tsslttglll sllvllvlvm lgasywyrar 721 lhqrlcqlkg ptcqyraaqs gpserpgppq rallargtkq asalsfpapp srplppdpvs 781 krlqaeladr pnpptrplpa dpvvrspksq gpakpppprk plpadpqgrc psgdlpgpga 841 gipplvvpsr papppptvss lyl => 128 amino acids

13 12. Human gene ADAM15 (Gene ID:8751) contains 24 exons. Several protein isoforms are produced by alternative splicing that may lead to a shift in the translation frame. For instance, a variant encoded by transcript NM_ is annotated as lacking three exons in the coding region compared to the longest isoform (transcript NM_ ). Identify nucleotide positions of the exons in the longest isoform, which are skipped in the shorter one. Which of these three exons causes a translation frameshift? What are the lengths of C-terminal amino acid sequences that are different in the encoded proteins? Using the database PROSITE ( determine what patterns can be located in these distinct C-termini (NB. in PROSITE search, in contrast to default option, do not exclude patterns with a high probability of occurrence). Patterns: Longer C-terminus (128aa) Shorter C-terminus (37aa)