Exercises (Multiple sequence alignment, profile search)

Similar documents
Transcription:

Exercises (Multiple sequence alignment, profile search) 8. Using Clustal Omega program, available among the tools at the EBI website (http://www.ebi.ac.uk/tools/msa/clustalo/), calculate a multiple alignment of five homologous Cas9 proteins from the following organisms: Streptococcus pyogenes (Accession NP_269215), Streptococcus thermophilus (WP_011225725), Staphylococcus aureus (CCK74173), Staphylococcus lugdunensis (WP_002460848) and Halalkalibacillus halophilus (WP_035512507). What is the length of the longest stretch of amino acid residues completely conserved in all 5 sequences? Give this motif in single letter code. (NB. The easiest input for Clustal Omega is a FASTA file of sequences, prepared in advance. For this assignment, default parameters of Clustal Omega are sufficient) 9. Using the database of protein profiles PROSITE (http://prosite.expasy.org/), determine whether the amino acid sequence vkpklplipghegvgvieevgpgvt contains some consensus pattern. Give the motif description of this consensus pattern. 10. Using the database of protein profiles PROSITE (http://prosite.expasy.org/), determine the positions of conserved motifs (profiles) in one of human proteins involved in RNA splicing, 9G8 (Accession NP_001026854). Using the resources of ENTREZ system, determine how many exons are in the 9G8 gene, and which of these exons contain the sequences encoding the found motifs. 11. Using the database Gene of ENTREZ, retrieve the sequences of the shortest and longest isoforms of the human enzyme ADAR adenosine deaminase (GeneID 103). What are the lengths of the amino acid sequences of these isoforms? Using the database of protein profiles PROSITE (http://prosite.expasy.org/), determine the positions and the function of a conserved motif contained in the longer isoform, but absent in the shortest one. 12. Human gene ADAM15 (Gene ID:8751) contains 24 exons. Several protein isoforms are produced by alternative splicing that may lead to a shift in the translation frame. For instance, a variant encoded by transcript NM_207191.2 is annotated as lacking three exons in the coding region compared to the longest isoform (transcript NM_207197.2). Identify nucleotide positions of the exons in the longest isoform, which are skipped in the shorter one. Which of these three exons causes a translation frameshift? What are the lengths of C-terminal amino acid sequences that are different in the encoded proteins? Using the database PROSITE (http://prosite.expasy.org/), determine what patterns can be located in these distinct C-termini (NB. in PROSITE search, in contrast to default option, do not exclude patterns with a high probability of occurrence).

8. Using Clustal Omega program, available among the tools at the EBI website (http://www.ebi.ac.uk/tools/msa/clustalo/), calculate a multiple alignment of five homologous Cas9 proteins from the following organisms: Streptococcus pyogenes (Accession NP_269215), Streptococcus thermophilus (WP_011225725), Staphylococcus aureus (CCK74173), Staphylococcus lugdunensis (WP_002460848) and Halalkalibacillus halophilus (WP_035512507). What is the length of the longest stretch of amino acid residues completely conserved in all 5 sequences? Give this motif in single letter code. (NB. The easiest input for Clustal Omega is a FASTA file of sequences, prepared in advance. For this assignment, default parameters of Clustal Omega are sufficient) FASTA file: >Cas9_S_lugdunensis MNQKFILGLDIGITSVGYGLIDYETKNIIDAGVRLFPEANVENNEGRRSKRGSRRLKRRRIHRLERVKKL LEDYNLLDQSQIPQSTNPYAIRVKGLSEALSKDELVIALLHIAKRRGIHKIDVIDSNDDVGNELSTKEQL NKNSKLLKDKFVCQIQLERMNEGQVRGEKNRFKTADIIKEIIQLLNVQKNFHQLDENFINKYIELVEMRR EYFEGPGKGSPYGWEGDPKAWYETLMGHCTYFPDELRSVKYAYSADLFNALNDLNNLVIQRDGLSKLEYH EKYHIIENVFKQKKKPTLKQIANEINVNPEDIKGYRITKSGKPQFTEFKLYHDLKSVLFDQSILENEDVL DQIAEILTIYQDKDSIKSKLTELDILLNEEDKENIAQLTGYTGTHRLSLKCIRLVLEEQWYSSRNQMEIF THLNIKPKKINLTAANKIPKAMIDEFILSPVVKRTFGQAINLINKIIEKYGVPEDIIIELARENNSKDKQ KFINEMQKKNENTRKRINEIIGKYGNQNAKRLVEKIRLHDEQEGKCLYSLESIPLEDLLNNPNHYEVDHI IPRSVSFDNSYHNKVLVKQSENSKKSNLTPYQYFNSGKSKLSYNQFKQHILNLSKSQDRISKKKKEYLLE ERDINKFEVQKEFINRNLVDTRYATRELTNYLKAYFSANNMNVKVKTINGSFTDYLRKVWKFKKERNHGY KHHAEDALIIANADFLFKENKKLKAVNSVLEKPEIESKQLDIQVDSEDNYSEMFIIPKQVQDIKDFRNFK YSHRVDKKPNRQLINDTLYSTRKKDNSTYIVQTIKDIYAKDNTTLKKQFDKSPEKFLMYQHDPRTFEKLE VIMKQYANEKNPLAKYHEETGEYLTKYSKKNNGPIVKSLKYIGNKLGSHLDVTHQFKSSTKKLVKLSIKP YRFDVYLTDKGYKFITISYLDVLKKDNYYYIPEQKYDKLKLGKAIDKNAKFIASFYKNDLIKLDGEIYKI IGVNSDTRNMIELDLPDIRYKEYCELNNIKGEPRIKKTIGKKVNSIEKLTTDVLGNVFTNTQYTKPQLLF KRGN >Cas9_H_halophilus MKQFENNYTLGLDIGIGSVGWGLVDEDQNIIDSGVRLFPEADVNNNTGRRGFRGARRLLRRRRHRLERIK MLLSNANLPTNQDKANAEETPYHIRVKGLTEKLSEEELSQALLHLGKRRGIHNVEVAEDETGGNELSTKD QLNQNAKALKNQYVCEVQLNRLENEGEVRGHRNRFKTSDYVAEARQLLSIQQKYHSKVTDEFIDQYLELI EKRREYYEGPGFGSEYGWEQDRQKWYEQMMGRCSYYPEELRSVKEAYSAQLFNVLNDLNNLVLTRDEDHK LSTEEKEELVEKVFKKYKSPKLNKIAKVLELKEDDIKGYRVTSKGTAEFTPLKIYHDLLGITDKKEVLED EDALDEIAEILTIYQTPSDIKEELEKLDLPLNKVDIESISELSSYSQTHSLSLKLIHQVIPDLWATPKNQ MQLFTENGIKPKKIKLEGKKYIPFHHLDEWILSPVVKRSFKQSIRIVNEIRKQYGEPKEIVIELARENSS DDKKNFLKELNKKNRAVNEAVMEKLESKDLEPKKGMFNKLRLWHIQDGLCMYSLKPIQIEDLLSNPTNYE IDHILPRSVSFDDSQKNKVLVHTEENQKKGNETPYQYLSSGKGHVSYEKYKSHVLQLAKSRDKMPKKKVE YLLEERDINKYDIQKEFINRNLVDTRYATRGLLTLLTTFFSENNKDVKVKAINGAFTDFLRKTWDFKKDR GADFKHHAEDALIVAMAGYLFQHQRELKKHNILLTEGKNGEEKTIDKETGEILEEKTFVNSFTERMDKVK AIKNYPNYKYSHKVDMKPNRQLMNDTLYSTRKVDDKEFVIEKIKDLYDKDQDKLVKQIKKDPTKLLMYHH DPQTYKKIERAIEQYSDAKNPLHKMYEETGEYLRKYSKKGNGPIIKSVKYYGNSLKEHKDVSHKFNTKDK KVVNLSLKPFRMDVYEDDGVYKFVTVSYKDLIEEKENYRINNDVYLQKLINKKIANKDGFVFSLYKNDVC KINGEYFRLIGVNHDEGNRIEMNKINYHYKDYAERNEIKQNRIYKGISKNTNEFIKIHTDVLGNVYFNSV EKFKSMYQK >Cas9_S_thermophilus etc...

8. Using Clustal Omega program, available among the tools at the EBI website (http://www.ebi.ac.uk/tools/msa/clustalo/), calculate a multiple alignment of five homologous Cas9 proteins from the following organisms: Streptococcus pyogenes (Accession NP_269215), Streptococcus thermophilus (WP_011225725), Staphylococcus aureus (CCK74173), Staphylococcus lugdunensis (WP_002460848) and Halalkalibacillus halophilus (WP_035512507). What is the length of the longest stretch of amino acid residues completely conserved in all 5 sequences? Give this motif in single letter code. (NB. The easiest input for Clustal Omega is a FASTA file of sequences, prepared in advance. For this assignment, default parameters of Clustal Omega are sufficient) CLUSTAL O(1.2.1) multiple sequence alignment Cas9_S_pyogenes ---MDKKYSIGLDIGTNSVGWAVITDEYKVPSKKFKVLGNTDRHSIKKNLIGALLFDS-- 55 Cas9_S_thermophilus ----MSDLVLGLDIGIGSVGVGILNKVTGEIIH-----------------KNSRIFPAAQ 39 Cas9_H_halophilus MKQFENNYTLGLDIGIGSVGWGLVDE-DQNIID-----------------SGVRLFPEAD 42 Cas9_S_lugdunensis ---MNQKFILGLDIGITSVGYGLIDYETKNIID-----------------AGVRLFPEAN 40 Cas9_S_aureus ---MKRNYILGLDIGITSVGYGIIDYETRDVID-----------------AGVRLFKEAN 40. :***** ***.::. :* Cas9_S_pyogenes GETAEATRLKRTARRRYTRRKNRICYLQEIFSNEMAKVDDSFFHRLEESFLVEEDKKHER 115 Cas9_S_thermophilus AENNLVRRTNRQGRRLTRRKKHRIVRLNRLFEESGLITDFT------------------- 80 Cas9_H_halophilus VNNNTGRRGFRGARRLLRRRRHRLERIKMLLSNANLPTNQD------------------- 83 Cas9_S_lugdunensis VENNEGRRSKRGSRRLKRRRIHRLERVKKLLEDYNLLDQSQ------------------- 81 Cas9_S_aureus VENNEGRRSKRGARRLKRRRRHRIQRVKKLLFDYNLLTDHS------------------- 81 :. * *.** *:.*: :: :: : : Cas9_S_pyogenes HPIFGNIVDEVAYHEKYPTIYHLRKKLVDSTDKADLRLIYLALAHMIKFRGHFLIEGDLN 175 Cas9_S_thermophilus ------------KISINLNPYQLRVK--GLTDELSNEELFIALKNMVKHRGISYLDDAS- 125 Cas9_H_halophilus ------------KANAEETPYHIRVK--GLTEKLSEEELSQALLHLGKRRGIHNVEVAE- 128 Cas9_S_lugdunensis -------------IPQSTNPYAIRVK--GLSEALSKDELVIALLHIAKRRGIHKIDVIDS 126 Cas9_S_aureus -------------ELSGINPYEARVK--GLSQKLSEEEFSAALLHLAKRRGVHNVNEV-- 124. * * * ::. : **.: * ** :: Cas9_S_pyogenes PDNSDVDKLFIQLVQTYNQLFEENPINASGVDAKAILSARLSKSRRLENLIAQLPGEKKN 235 Cas9_S_thermophilus -DDGN------SSVGDYAQIVKENSK-------------------QLE---TKTPGQIQL 156 Cas9_H_halophilus DETGG------NELST-KDQLNQNAK-------------------ALK---NQYVCEVQL 159 Cas9_S_lugdunensis NDDVG------NELST-KEQLNKNSK-------------------LLK---DKFVCQIQL 157 Cas9_S_aureus EEDTG------NELST-KEQISRNSK-------------------ALE---EKYVAELQL 155 :. : :...* *: : : :...(complete alignment not shown)

9. Using the database of protein profiles PROSITE (http://prosite.expasy.org/), determine whether the amino acid sequence vkpklplipghegvgvieevgpgvt contains some consensus pattern. Give the motif description of this consensus pattern.

9. Using the database of protein profiles PROSITE (http://prosite.expasy.org/), determine whether the amino acid sequence vkpklplipghegvgvieevgpgvt contains some consensus pattern. Give the motif description of this consensus pattern.

9. Using the database of protein profiles PROSITE (http://prosite.expasy.org/), determine whether the amino acid sequence vkpklplipghegvgvieevgpgvt contains some consensus pattern. Give the motif description of this consensus pattern.

10. Using the database of protein profiles PROSITE (http://prosite.expasy.org/), determine the positions of conserved motifs (profiles) in one of human proteins involved in RNA splicing, 9G8 (Accession NP_001026854). Using the resources of ENTREZ system, determine how many exons are in the 9G8 gene, and which of these exons contain the sequences encoding the found motifs. >9G8 MSRYGRYGGETKVYVGNLGTGAGKGELERAFSYYGPLRTVWIARNPPGFAFVEFEDPRDAEDAVRGLDGK VICGSRVRVELSTGMPRRSRFDRPPARRPFDPNDRCYECGEKGHYAYDCHRYSRRRRSRSRSRSHSRSRG RRYSRSRSRSRGRRSRSASPRRSRSISLRRSRSASLRRSRSGSIKGSRYFQSPSRSRSRSRSISRPRSSR SKSRSPSPKRSRSPSGSPRRSASPERMD

10. Using the database of protein profiles PROSITE (http://prosite.expasy.org/), determine the positions of conserved motifs (profiles) in one of human proteins involved in RNA splicing, 9G8 (Accession NP_001026854). Using the resources of ENTREZ system, determine how many exons are in the 9G8 gene, and which of these exons contain the sequences encoding the found motifs. In the datafields of NP_001026854: CDS 1..238 /gene="srsf7" /gene_synonym="9g8; AAG3; SFRS7" /coded_by="nm_001031684.2:239..955" /note="isoform 1 is encoded by transcript variant 1" /db_xref="ccds:ccds33183.1" /db_xref="geneid:6432" (Scroll down) In the datafields of NM_001031684: features: exon 1..266 1 CDS 239-955 267..447 2 448..624 3 625..699 4 700..810 5 811..864 6 865..900 7 901..2489 8

10. Using the database of protein profiles PROSITE (http://prosite.expasy.org/), determine the positions of conserved motifs (profiles) in one of human proteins involved in RNA splicing, 9G8 (Accession NP_001026854). Using the resources of ENTREZ system, determine how many exons are in the 9G8 gene, and which of these exons contain the sequences encoding the found motifs. Arithmetics of coding sequences: RRM motif: 239 + 3x10 = 269 238 + 3x84 = 490 => coded by exons 2 & 3 ZF_CCHC: 239 + 3x104= 551 238 + 3x119= 595 => coded by exon 3 Or BLAST 2 sequences: RRM motif: tblastn NP_001026854 pos. 11-84 vs. NM_001031684: pos.269-490: exons 2 & 3 ZF_CCHC: tblastn NP_001026854 pos. 105-119 vs. NM_001031684: pos. 551-595: exon 3 In the datafields of NM_001031684: features: exon 1..266 1 CDS 239-955 267..447 2 448..624 3 625..699 4 700..810 5 811..864 6 865..900 7 901..2489 8

11. Using the database Gene of ENTREZ, retrieve the sequences of the shortest and longest isoforms of the human enzyme ADAR adenosine deaminase (GeneID 103). What are the lengths of the amino acid sequences of these isoforms? Using the database of protein profiles PROSITE (http://prosite.expasy.org/), determine the positions and the function of a conserved motif contained in the longer isoform, but absent in the shortest one. Gene -> mrna and Proteins -> NP_001102.2 (1226 amino acids) is the longest, NP_001020278.1 (931 aa) is the shortest. Scan Prosite with 931 aa: Scan Prosite with 1226 aa: PS50139 DRADA_REPEAT DRADA repeat profile : 1-65: score = 26.922 ---MAEIKEKICDYLFNVSDSSALNLAKNIGLTK-ARDINAVLIDMERQGDVYRQGTTPP IWHLTDKKR PS50137 DS_RBD Double stranded RNA-binding domain (dsrbd) profile : 208-276: score = 21.190 NPISGLLEYAQFASQTCEFNMIEQSGPPHEPRFKFQVVINGREFPPAEAGSKKVAKQDAA MKAMTILLE 319-387: score = 19.224 SPVTTLLECMHKLGNSCEFRLLSKEGPAHEPKFQYCVAVGAQTFPSVSAPSKKVAKQMAA EEAMKALHG 431-499: score = 19.564 NPVGGLLEYARSHGFAAEFKLVDQSGPPHEPKFVYQAKVGGRWFPAVCAHSKKQGKQEAA DAALRVLIG PS50141 A_DEAMIN_EDITASE Adenosine to inosine editase domain profile : 591-926: score = 128.079 SLGTGNRCVKGDSLSLKGETVNDCHAEIISRRGFIRFLYSELMKYNSQTAKDSIFEPAKG GEKLQIKKTVSFHLYISTAPCGDGALFDksCSDRAMESTESRHYPVFENPKQGKLRTKVE NGEGTIPVESSDIVPTWDGIRLGERLRTMSCSDKILRWNVLGLQGALLTHFLQPIYLKSV TLGYLFSQGHLTRAICCRVTRdgsAFEDGLRHPFIVNHPKVGRVSIYDSKRQSGKTKETS VNWCLADGyDLEILDGTRGTVDGPRNELSRVSKKNIFLLFKKLCSFRYRRDLLRLSYGEA KKAARDYETAKNYFKKGLKDMGYGNWISKPQEEKNF PS50139 DRADA_REPEAT DRADA repeat profile : 133-202: score = 27.774 LSIYQDQEQRILKFLEELGEgKATTAHDLSGKLGTPKKEINRVLYSLAKKGKLQKEAGTP PLWKIAVSTQ 293-360: score = 27.929 FLDMAEIKEKICDYLFNVSDSSALNLAKNIGLTK-ARDINAVLIDMERQGDVYRQGTTPP IWHLTDKKR PS50137 DS_RBD Double stranded RNA-binding domain (dsrbd) profile : 503-571: score = 21.190 NPISGLLEYAQFASQTCEFNMIEQSGPPHEPRFKFQVVINGREFPPAEAGSKKVAKQDAA MKAMTILLE 614-682: score = 19.224 SPVTTLLECMHKLGNSCEFRLLSKEGPAHEPKFQYCVAVGAQTFPSVSAPSKKVAKQMAA EEAMKALHG 726-794: score = 19.564 NPVGGLLEYARSHGFAAEFKLVDQSGPPHEPKFVYQAKVGGRWFPAVCAHSKKQGKQEAA DAALRVLIG PS50141 A_DEAMIN_EDITASE Adenosine to inosine editase domain profile : 886-1221: score = 128.079 SLGTGNRCVKGDSLSLKGETVNDCHAEIISRRGFIRFLYSELMKYNSQTAKDSIFEPAKG GEKLQIKKTVSFHLYISTAPCGDGALFDksCSDRAMESTESRHYPVFENPKQGKLRTKVE NGEGTIPVESSDIVPTWDGIRLGERLRTMSCSDKILRWNVLGLQGALLTHFLQPIYLKSV TLGYLFSQGHLTRAICCRVTRdgsAFEDGLRHPFIVNHPKVGRVSIYDSKRQSGKTKETS VNWCLADGyDLEILDGTRGTVDGPRNELSRVSKKNIFLLFKKLCSFRYRRDLLRLSYGEA KKAARDYETAKNYFKKGLKDMGYGNWISKPQEEKNF From Prosite description: DRADA repeats are exclusively present in the double stranded-specific adenosine deaminase (DRADA) family. This enzyme deaminates multiple adenosines to inosines by a hydrolytic deamination reaction only on double-stranded RNA...

12. Human gene ADAM15 (Gene ID:8751) contains 24 exons. Several protein isoforms are produced by alternative splicing that may lead to a shift in the translation frame. For instance, a variant encoded by transcript NM_207191.2 is annotated as lacking three exons in the coding region compared to the longest isoform (transcript NM_207197.2). Identify nucleotide positions of the exons in the longest isoform, which are skipped in the shorter one. Which of these three exons causes a translation frameshift? What are the lengths of C-terminal amino acid sequences that are different in the encoded proteins? Using the database PROSITE (http://prosite.expasy.org/), determine what patterns can be located in these distinct C-termini (NB. in PROSITE search, in contrast to default option, do not exclude patterns with a high probability of occurrence). Three missing exons are seen in the graphic output: Alignment (e.g. BLAST 2 sequences) NM_207197.2 vs. NM_207191.2: nucleotide positions of the missing exons are identified. Parts of the alignment: Query 2281 CCAGCGACTCTGCCAGCTCAAGGGACCCACCTGCCAGTACAG 2322 (NM_207197.2) Sbjct 2281 CCAGCGACTCTGCCAGCTCAAGGGACCCACCTGCCAGTACAG 2322 (NM_207191.2) Query 2538 AGTCTCAGGGGCCAGCCAAGCCCCCACCCCCAAGGAAGCCACTGCCTGCCGACCCCCAGG 2597 Sbjct 2321 AGTCTCAGGGGCCAGCCAAGCCCCCACCCCCAAGGAAGCCACTGCCTGCCGACCCCCAGG 2380 In the annotation of NM_207197.2: features: exon 2323..2392 => 70 nucleotides exon 2393..2467 => 75 nucleotides exon 2468..2539 => 72 nucleotides The 70-nt exon deletion causes a translation frameshift (70 3n)

12. Human gene ADAM15 (Gene ID:8751) contains 24 exons. Several protein isoforms are produced by alternative splicing that may lead to a shift in the translation frame. For instance, a variant encoded by transcript NM_207191.2 is annotated as lacking three exons in the coding region compared to the longest isoform (transcript NM_207197.2). Identify nucleotide positions of the exons in the longest isoform, which are skipped in the shorter one. Which of these three exons causes a translation frameshift? What are the lengths of C-terminal amino acid sequences that are different in the encoded proteins? Using the database PROSITE (http://prosite.expasy.org/), determine what patterns can be located in these distinct C-termini (NB. in PROSITE search, in contrast to default option, do not exclude patterns with a high probability of occurrence). Proteins encoded by long and short isoforms: NP_997074 772 aa 1 mrlallwalg llgagsplps wplpniggte eqqaesekap replepqvlq ddlpislkkv 61 lqtslpeplr ikleldgdsh ilellqnrel vpgrptlvwy qpdgtrvvse ghtlenccyq 121 grvrgyagsw vsictcsglr glvvltpers ytleqgpgdl qgppiisriq dlhlpghtca 181 lswresvhtq kppehplgqr hirrrrdvvt etktvelviv adhseaqkyr dfqhllnrtl 241 evallldtff rplnvrvalv gleawtqrdl veispnpavt lenflhwrra hllprlphds 301 aqlvtgtsfs gptvgmaiqn sicspdfsgg vnmdhstsil gvassiahel ghslgldhdl 361 pgnscpcpgp apaktcimea stdflpglnf sncsrralek alldgmgscl ferlpslppm 421 aafcgnmfve pgeqcdcgfl ddcvdpccds ltcqlrpgaq casdgpccqn cqlrpsgwqc 481 rptrgdcdlp efcpgdssqc ppdvslgdge pcaggqavcm hgrcasyaqq cqslwgpgaq 541 paaplclqta ntrgnafgsc grnpsgsyvs ctprdaicgq lqcqtgrtqp llgsirdllw 601 etidvngtel ncswvhldlg sdvaqplltl pgtacgpglv cidhrcqrvd llgaqecrsk 661 chghgvcdsn rhcyceegwa ppdcttqlka tsslttglll sllvllvlvm lgasywyrar 721 lhqrlcqlkg ptcqyslrgq psphpqgshc lptpragahr vtcpaqgles rp 736-772 => 37 amino acids NP_997080 863 aa 1 mrlallwalg llgagsplps wplpniggte eqqaesekap replepqvlq ddlpislkkv 61 lqtslpeplr ikleldgdsh ilellqnrel vpgrptlvwy qpdgtrvvse ghtlenccyq 121 grvrgyagsw vsictcsglr glvvltpers ytleqgpgdl qgppiisriq dlhlpghtca 181 lswresvhtq kppehplgqr hirrrrdvvt etktvelviv adhseaqkyr dfqhllnrtl 241 evallldtff rplnvrvalv gleawtqrdl veispnpavt lenflhwrra hllprlphds 301 aqlvtgtsfs gptvgmaiqn sicspdfsgg vnmdhstsil gvassiahel ghslgldhdl 361 pgnscpcpgp apaktcimea stdflpglnf sncsrralek alldgmgscl ferlpslppm 421 aafcgnmfve pgeqcdcgfl ddcvdpccds ltcqlrpgaq casdgpccqn cqlrpsgwqc 481 rptrgdcdlp efcpgdssqc ppdvslgdge pcaggqavcm hgrcasyaqq cqslwgpgaq 541 paaplclqta ntrgnafgsc grnpsgsyvs ctprdaicgq lqcqtgrtqp llgsirdllw 601 etidvngtel ncswvhldlg sdvaqplltl pgtacgpglv cidhrcqrvd llgaqecrsk 661 chghgvcdsn rhcyceegwa ppdcttqlka tsslttglll sllvllvlvm lgasywyrar 721 lhqrlcqlkg ptcqyraaqs gpserpgppq rallargtkq asalsfpapp srplppdpvs 781 krlqaeladr pnpptrplpa dpvvrspksq gpakpppprk plpadpqgrc psgdlpgpga 841 gipplvvpsr papppptvss lyl 736-863 => 128 amino acids

12. Human gene ADAM15 (Gene ID:8751) contains 24 exons. Several protein isoforms are produced by alternative splicing that may lead to a shift in the translation frame. For instance, a variant encoded by transcript NM_207191.2 is annotated as lacking three exons in the coding region compared to the longest isoform (transcript NM_207197.2). Identify nucleotide positions of the exons in the longest isoform, which are skipped in the shorter one. Which of these three exons causes a translation frameshift? What are the lengths of C-terminal amino acid sequences that are different in the encoded proteins? Using the database PROSITE (http://prosite.expasy.org/), determine what patterns can be located in these distinct C-termini (NB. in PROSITE search, in contrast to default option, do not exclude patterns with a high probability of occurrence). Patterns: Longer C-terminus (128aa) Shorter C-terminus (37aa)