Host : Dr. Nobuyuki Nukina Tutor : Dr. Fumitaka Oyama

Size: px
Start display at page:

Download "Host : Dr. Nobuyuki Nukina Tutor : Dr. Fumitaka Oyama"

Transcription

1 Method to assign the coding regions of ESTs Céline Becquet Summer Program 2002 Structural Neuropathology Lab Molecular Neuropathology Group RIKEN Brain Science Institute Host : Dr. Nobuyuki Nukina Tutor : Dr. Fumitaka Oyama

2

3 Abstract The present study involves an investigation of deregulated genes in Huntington disease. This disease is an autosomal dominant neurodegenerative disorder caused by poly-glutamine expansion in the disease protein, huntingtin. A GeneChip experiment was previously preformed to compare extracted mrnas from cerebrum cells of wild type mice (WT) and of HD mouse models expressing the pathological form of the huntingtin protein. We identified several ESTs that may be involved in the pathogenesis of Huntington disease. To find more information about these ESTs, we developed a bioinformatical method which allows the 5 -ends of the ESTs to be found and the hypothetical Coding Sequences (CDSs) and exons of the mrnas corresponding to the ESTs to be predicted. Using this method, we were able to show homology between the mouse ESTs TC and TC and Human Sodium Channel Beta-2 Subunit Precursor (XM_171492) and Human cdna (AK092285). We built some hypotheses about these two mouse mrnas that are about to be confirmed. We also showed that the mouse protein phosphatase (EST TC478977) is an incomplete mrna sequence. The 9 th exon of its mrna is the 5 -end of the total 9 th exon whose 3 -end is the mouse EST TC Confirmation of our hypotheses are now in process.

4 Contents Abstract Contents Introduction... 1 Bioinformatical Methodology Blast search using Tigr or NCBI databases Tigr Database Making a search Mouse Gene Index Report NCBI Database Non-redundant database EST-Mouse database Chromosome location and genomic sequence selection Chromosome location Genomic sequence selection Selection process Use of the genomic sequence Exons prediction using genomic sequence Process of prediction Prediction analysis Search in RIKEN 5 -ends sequences database Process of the search Which sequence for which result Hypothetical mrna, primers Mouse mrna Hypothetical mrna Confirmation by RT-PCR... 8 Results & Hypotheses ESTs F and C Description EST F: TC

5 1.1.2 EST C: TC Extension in 5 direction Homology Homology to a mrna similar to the Human Sodium Channel Beta-2 Subunit Precursor (XM_171492) Homology to Human cdna (AK092285) Views on the two genomes (Figure 2) ends and Confirmation of the predicted exons and conserved regions Hypotheses F and C on the same mrna Hypothesis FC1: 1 exon of 5 Kbp Hypothesis FC2: Promoter + 4 or 5 exons: >4 Kbp Hypothesis FC3 : known 5 -end + 5 or 6 exons : >4,7 Kbp Hypotheses F4 and C4 on different mrnas: F4=Promoter + 4 exons: >3,5 Kbp; C4=Known 5 -end+5 exons: >3,5 Kbp Primers designed Hypotheses confirmation Hypothetical Exons confirmation EST B: TC Homologous to the human protein phosphatase 1, regulatory subunit 16B (PPP1R16B) (XM_028840) Hypothesis B: TC laps+B EST = 6412 bp B primer Conclusion Acknowledgments Supplements 1- F EST 2- Extended F EST 3- Prediction 1 4- Prediction 2

6 Introduction Through the Summer Program 2002, the Brain Science Institute of RIKEN provided me the opportunity to work in the lab of Structural Neuropathology directed by Dr. Nukina. In this laboratory, researchers are attempting to find the deregulated genes in Huntington disease. This disease is an autosomal dominant neurodegenerative disorder caused by poly-glutamine expansion in the disease protein, huntingtin. Of course, some of these interesting genes are novel and of unknown function therefore the energy of researchers is mostly involved in finding information about these genes. My supervisor, Dr. Oyama, performed a GeneChip experiment to compare the extracted mrnas from cerebrum cells of wild type mice (WT) and HD mouse models. The HD model mice express the pathological form of the huntingtin protein. On a GeneChip, many of the nucleotide targets are ESTs sequences. An EST is a part of a mrna sequence whose complete sequence, function and constitution are unknown. As a result, these target sequences hybridize with mrna probes for which we have almost no information. The results of the GeneChip analysis demonstrate that some ESTs are strongly down regulated in the HD mice. The down regulation of these genes had been confirmed by northern blotting of the transcriptome of the cerebrum cells of the WT and HD mice. This northern blot allowed the size of the interesting ESTs to be estimated. Those ESTs confirmed to be down regulated by In-situ hybridization in different HD mouse models may be involved in the pathogenesis of the Huntington disease. The first step in the investigation of these ESTs of interest is the amplification of the corresponding mrnas by Reverse Transcriptase-PCR (RT-PCR). To do so, we need the 5 -end sequence of the mrnas. However, this information is unknown for the ESTs. It is in this context that I was asked to develop a bioinformatical method to find the 5 -ends of those ESTs implicated in Huntington disease. This method uses only tools that are freely available on the Internet. It allows the hypothetical Coding Sequences (CDSs) and exons of the mrnas corresponding to these ESTs to be predicted. In the following report, I will explain my methodology. I will then display the results and hypotheses I made about 3 ESTs which are particularly interesting because of their strong and confirmed down regulation in different HD mouse models. 1

7 Bioinformatical Methodology Using the GeneChip, the mouse mrnas corresponding to ESTs of interest could be identified. It is necessary to find the 5 -end of these mrnas and also to find their constitution in terms of exons and introns. In addition, some idea of the function of these genes would be invaluable. The methodology described below demonstrates the different stages in the search for information on the ESTs. 1 Blast search using Tigr or NCBI databases To begin any work we need the EST sequence. The GeneChip manufacturers provide the TC number associated with the nucleotide target for each plot of the GeneChip. 1.1 Tigr Database To find the information about the TC number, we have to search in The Institute for Genomic Research (Tigr) Database. This database is available in the web site Making a search To make a search with the TC number in the Tigr database we have to go in the page about Gene Indices ( BLAST algorithm On the Gene Indices page, the link BLAST search displays a Query page ( In that Query page, we can choose to work only on the mouse database or with the both human and mouse databases. The BLAST algorithm finds nucleotide sequences from the Tigr Databases that have some similarities with the input sequence. For each similar sequence, a link provides the Mouse Gene Index Report (MGI Report cf. information part below). This algorithm will also be useful later to confirm the predicted exons (cf. Exons Prediction part 3 below). Search index by identifier Some links allow the selection of the specific genome of an organism. The Mouse link opens the page of Mouse Gene Index ( The Tigr Mouse Gene Index page provides several tools to search in the Tigr Mouse Databases using the sequence of the ESTs. By selecting the link Search Index by Identifier (TC, ET, EST, GB), the MGI Report page ( is displayed, where it is possible to enter the TC number of the GeneChip target and search the EST s information corresponding to this identifier. 2

8 1.1.2 Mouse Gene Index Report The MGI Report gives the sequence of mrna corresponding to the TC number. The size of the sequence and the predicted Open Reading Frames are provided. There are also alignments of all the ESTs that recognize this mrna sequence. For each of these ESTs, there is a link leading to the sequence and ID numbers of this EST. For some ESTs, the MGI report displays the link Expression summary. This page provides information concerning the expression of the mrna recognized by the EST in different cell types. The MGI report also provides for some TC numbers the Tentative Ortholog Group. This link leads to a page where similar ESTs of different organisms are aligned. Thus, it is possible to find genes homologous to the query mrna in other organisms. For most of the TC numbers, the positions of the EST on the mouse genome is provided in the MGI Report. A link leads to the Genomic Context of the EST s sequence. Here, the alignment of some homologous ESTs or genes in other organisms genomes can be found. 1.2 NCBI Database The web site of the National Center for Biotechnology Information (NCBI /) provides all the tools necessary to make a search on its databases. We can find information about any biological entity in the different available databases. To do the search, we need to select the database (PubMed, Protein, Nucleotide, Structure, Genome ). It is possible to make the search with an ID number, a keyword, or even an author name. The BLAST link ( displays tools for making an alignment between a protein or a nucleotide sequence and the sequences of the available databases (RefSeq models, GenBank or EMBL ). In the method, we often use the EST-Mouse and the Non-redundant databases. To make a search on these two databases it is enough to select the link Standard nucleotidenucleotide BLAST [blastn] ( which displays the Query page, where the query sequence can be input. We then select the database we want to work with in the corresponding field of the Query page Non-redundant database Making a search in the non-redundant database allows all of the sequences similar to the query sequence to be found. The default options allow 100 sequences to be displayed, irrespective of the score of the alignment between the 2 sequences. It is important to be critical about the results provided at this point. When 2 similar sequences longer than 200 nucleotides have only one hit and the identity is less than 25 nucleotides, the term similar means almost nothing. What we can say is that a small region is similar, so, for example, a profile may be conserved between the both sequences. 3

9 Moreover, the identifiers change depending of the type of sequences found. Caution is especially necessary with XM sequences (mrnas sequences produced by the NCBI's Genome Annotation Project). These represent the known or potential transcripts of a gene. Such a sequence is not as reliable as a sequence annotated by experiment. To find a human homologous sequence, or to find the mrna sequence corresponding to an EST sequence, we have to study the first and best-scored sequences that the blast proposes. For each of the similar sequences found, a link provides the Information Sheet of the sequence. We can know if this sequence has been predicted or if it is a experimental mrna, if the provided sequence is only the coding regions or the total mrna EST-Mouse database If the search in the non-redundant database does not provide the total mrna corresponding to the EST sequence, we have to find some mouse ESTs that could extend the sequence we have. We can use the EST-Mouse database to blast the EST sequence directly. But in that case, we usually only find the sequences of smaller ESTs than the sequence we have. This is due to the fact that the sequence provided by the Mouse Gene Index Report is a merge of all the ESTs that recognize the same part of a mrna. Moreover, other ESTs may recognize other parts of the mrna. Therefore, blasting the EST sequence against the EST-Mouse database eventually allows a sequence with a short similar region with our EST sequence to be found and which can extend the sequence in one direction or the other. Generally, we use this database by blasting a part of the genomic sequence near the position of our interesting EST (cf. Genomic sequence selection part 2.2 below). If we find ESTs similar to these genomic regions, we can assume that these regions belong to the mrna for which we search. We can also blast a predicted mrna using this database (cf. Exons prediction part 3 below). If we find some similar EST sequences, one can have confidence in the predicted exons that fit with these sequences. 2 Chromosome location and genomic sequence selection The searches above can generate a large amount of similar sequences. It is interesting to see their positions relative to the original EST sequence along the mouse chromosome. It is also interesting to find the positions of homologous genes or of the conserved regions in another organism. We are mostly interested in the human homologue because the human genome is well annotated. 2.1 Chromosome location To do so, we blast the different sequences with the database of the Human Genome and the Mouse Genome that the NCBI web site provides. 4

10 If the sequence we blast is smaller than 100 nucleotides, it is better to change the default options (Expect at 10, and filter at none). The BLAST result page displays the alignments and the positions along the contigs in which the similar regions are located. For each alignment, a link leads to a Genome View displaying the hits of the query sequence along the organism genome. By selecting the chromosome of interest on the Genome View, it is possible to see the alignment of the similar regions on this chromosome. On this Chromosome View, positions given are those relative to the total chromosome, which are different to the positions on the contig given by the Blast result page. 2.2 Genomic sequence selection Selection process On the Chromosome View, two fields allow positions of the view to be changed. We have the choice between a zoom of the hits, or a global view around the hits. To save the selected contig sequence we open the Download/View Sequence/Evidence page. This page reports the positions on the chromosome and on the contig. The links Display and Save to Disk allows the genomic sequence to be saved. The link View Evidence displays all the RefSeq models, GenBank mrnas, annotated known or potential transcripts, and ESTs that align to the area of interest Use of the genomic sequence Zoomed sequence By blasting a homologous mrna of another organism (e.g. human) against the Mouse Genome, we can find conserved regions on the mouse chromosome. We can then select the zoomed sequences of the similar regions of the mouse chromosome and blast them against the EST- Mouse database (cf. part 1.2 above) or against the Tigr Mouse database (cf. BLAST Search part above). If we find an EST similar to a genomic region, we can confidently predict that this region is a conserved coding region and belongs to an exon of the mouse mrna we search. Big genomic sequence Some exon prediction algorithms (cf. Exons Prediction part 3 below) can analyze a big genomic sequence around the hit of the original EST s sequence. Moreover, this sequence could contain the totality or some parts of the gene we search. By blasting this mouse gene sequence against the Human Genome, we can find the conserved regions. If the human homologue is known and well annotated and if these regions coincide with some exons of the homologous gene, we can consider these regions as exons or coding regions in the mouse gene

11 3 Exons prediction using genomic sequence If we did not find any homologue, or any mouse mrna sequence using the techniques described above, exon prediction may provide information concerning the mrna s constitution. Furthermore, in cases where the sequence of the total CDSs has been found, it is still of interest to define the non-coding regions of the mrna. 3.1 Process of prediction The algorithms we used are Genscan ( and Grail ( The options allow selecting the kind of organism we work on. Grail predictions may be verified with the nucleotide and EST databases and can sometimes predict the promoter of a predicted gene. The process involves analyzing mouse genomic sequences of different sizes covering the region around the EST s positions using these algorithms. The algorithms often give different results, but some exons are very well predicted (with a good score) by both methods and using different sizes of genomic sequences. 3.2 Prediction analysis To have a clear view of their positions on the mouse chromosome, it is useful to blast these different sequences of predicted mrnas against the Mouse Genome (cf. Chromosome location part 2.1 above). A comparison between the positions of the predicted exons and hits of the homologous human gene (if known) along the mouse chromosome allows confirmation of the prediction. We can also blast these sequences against the Human Genome (cf. Chromosome location part 2.1 above). If the homologue is well annotated, the hits of the predicted mrna that are similar to the exons of the human gene confirm that the regions hit are some exons in the mouse mrna we search. It is also interesting to blast the predicted mrnas against the EST-Mouse database (cf. part 1.2 above) and the Tigr Mouse database (cf. BLAST Search part above). In this manner, ESTs may be found that could confirm that the predicted exons are part of the mrna for which we search. For each ESTs corresponding to our mrna, confirmation that is derived from the same chromosome as our mrna is necessary. Because of the algorithms default options, some short EST sequences appear in the results even if the scores are bad. If we do not have the information of the location in the MGI Report (cf. MGI Report part above), we have to blast the EST s sequence against the Mouse Genome. If the EST s sequence is too short, we have to change the default option (cf. Chromosome location part 2.1 above). Then by comparing the Chromosome Views of the ESTs, the predicted mrnas, and the conserved regions between the mouse and human, we can check which exons are confirmed. 6

12 4 Search in RIKEN 5 -ends sequences database The most interesting information for which we search is the 5 -end of the mrna. If this sequence is known, it is possible to design a primer and amplify the mrna of interest by RT-PCR. 4.1 Process of the search The Gene Science Laboratory of the Genome Exploration Research Group of RIKEN works on the mouse full-length cdna encyclopedia project. This project involves collecting data on most of the mouse full-length cdnas, their primary structures and expression sites. It builds databases of mouse 5 and 3 -ends and of full-length cdnas sequences. These databases are available in the web site To make a search on these databases we select the link Search RIKEN Mouse cdna Encyclopedia on the Home page, or we select the link Our Activities. The Our Activities page displays tools to work on this Encyclopedia. The link Homology search on Our database displays the page of the RIKEN Mouse Encyclopedia Index where the link Homology search leads to a Query page. On this page, we can enter a nucleotide sequence and blast it against the RIKEN databases. A field allows selection of only one or two of the databases we want to work with. The result page gives the ID numbers and links of the cdnas sequences that align with the input sequence. For each cdna s sequence, a link leads to an Information Sheet where we can find the nucleotide sequence and other information about it. 4.2 Which sequence for which result When we have the mrna or predicted mrna corresponding to our EST, we can blast it against the RIKEN 5 -ends sequences database. If we find a 5 -end sequence corresponding to the 5 -end of our mrna, we have enough information to define a primer. If we do not find a corresponding 5 -end, the prediction together with the hits of the human homologue suggest the position of the 5 -end on the mouse genome. A Blast of a part of the genomic sequence around this positions (cf. Genomic sequence selection part 3. above) against the RIKEN 5 -end cdnas sequences database may allow identification of the 5 -end sequence. Sometimes, the Information sheet displays a 3 -end sequence associated with the 5 -end sequence we found. In that case, we have to blast these two sequences against the Mouse Genome. With this information, we can check if our EST corresponds to this 3 -end sequence. Hence, we know if we have predicted the correct gene and not the following gene on the mouse chromosome. 7

13 5 Hypothetical mrna, primers Regarding the data we managed to collect for our mouse EST, we can design hypothetical mrnas for this EST. It is then possible to select specific sequences that could be the primers to confirm the hypotheses. 5.1 Mouse mrna Sometimes it may be possible to find the mouse mrna sequence that corresponds to our EST. In that case, the primer can be the 5 -end of this mrna sequence. If the 5 -end has been confirmed by a blast against the RIKEN 5 -end cdnas sequences database, we can use the RIKEN sequence to design the primer. 5.2 Hypothetical mrna If we do not have a mouse mrna sequence that has been experimentally found, but only the predicted mrna, several different predictions for the mrna constitution may be equally compatible. These hypotheses are built by regarding which exon corresponds to a hit of the human homologue on the mouse chromosome, which exon corresponds to a mouse EST and if the RIKEN 5 -end has been found. If we have confirmed the 5 -end of the predicted mrna, we can use the RIKEN 5 -end sequence to design a primer. If we do not have confirmation of the 5 -end of the predicted mrna, we can use the first predicted exon to design a primer. If the subsequent amplification does not work, the 5 -end of human homologous mrna (if known) may also be tried. If we want to confirm our hypotheses concerning the mrna s constitution, we can use each of the predicted exons sequences or the ESTs sequences that confirm the predicted exons to design the primers. 6 Confirmation by RT-PCR The next step consists in confirming the hypothesis. To do so, we perform RT- PCR using the different primers designed above. The northern blot of the RT-PCR product shows if the primer belongs to our mrna or not. It also gives the size of the RT-PCR product. If the size of the RT-PCR product is similar to the estimated size (cf. Introduction above) there is a big probability that the primer we used is the 5 -end of the mrna we want to study. The RT-PCR products are all sequenced and an analysis of the sequences (Blast against the Mouse Genome cf. Chromosome location part 2.1 above) will display the real exons. By comparing the Chromosome Views, we will confirm or not the hypothetical exons that we have defined in the method. 8

14 Results & Hypotheses We now demonstrate the information and data we found using the aforementioned method about 3 ESTs particularly down regulated in HD mice. 1 ESTs F and C These 2 ESTs seems to have a strong impact on the pathogenesis of the Huntington disease. They are both strongly down regulated in the HD weeks mouse model. This down regulation was confirmed in the C01 16 weeks mouse model as we can clearly see on the northern blots (cf. Figure1 below). ESTs F and C will be studied together because, as we will see below, they are linked one with the other. Northern blots of ESTs C and F between wild type and HD mice 28S 4,6 kb 5 kb 28S Figure 1: Northern blots of ESTs C and F 9

15 1.1 Description EST F: TC The size of the F mrna has been estimated by northern blot at 5kb (cf. Figure1 above). The sequence of the F EST we have is 2729 bp (cf. Sequence in Supplement 1: F EST). A sequence of 2413 bp sequenced in the lab (and not submitted to GenBank yet) and the 3 -end of the sequence we found in the MGI report TC constitute this sequence. Its position on mouse chromosome n 9 is EST C: TC The sequence C we have is 568 bp (this sequence is available in the Tigr Database cf. Methodology, part 1.1.2). The estimated size of the C mrna by northern blot is about 4,6 Kbp (cf. Figure1 above). The position of this sequence on mouse chromosome n 9 is Extension in 5 direction We found the 5 sequence of mouse cdna BI (918 bp) that extends sequence F (identity between positions of sequence BI and the 5 - end (positions 1 313) of sequence F). We merged the sequence F EST of 2729 bp and the first 366 nucleotides of BI Thus, the 3 -end of this EST had been deleted. We obtained an extended F EST of 3095 bp (cf. Sequence in Supplement 2: Extended F EST). Its positions on the mouse chromosome n 9 are and This extension will have to be confirmed. 1.3 Homology We took the mouse genomic sequence Mouse Contig 1 from positions to on the mouse chromosome n 9 (= positions on the mouse contig NW_ cf. Methodology, Genomic sequence selection part 2.2) to check for conserved parts of mouse chromosome n 9 in the human genome (cf. Methodology, Chromosome location part 2.1). We found that mouse chromosome n 9 is similar to some part of human chromosome n Homology to a mrna similar to the Human Sodium Channel Beta-2 Subunit Precursor (XM_171492) We can see below in the view of human chromosome n 11 that most of the hits of the mouse genomic sequence Mouse Contig 1 are in the gene of a human mrna similar to the Human Sodium Channel Beta-2 Subunit Precursor (XM_171492). But most of these hits do not fit with an exon of this mrna. This could 10

16 be due to the fact that this mrna has been predicted (the predicted exons could be too short and the similar regions could be extensions of these predicted exons). Alternatively, our observation could be due to the fact that a gene was interpenetrated in the Human Precursor gene. The predicted Human Precursor mrna (XM_171492) is only 1182 bp. Its position on human chromosome n 11 is (cf. view part below) Homology to Human cdna (AK092285) We found that the EST C is similar to the 3 -end of Human cdna (AK092285) whose function is unknown. Its size is 2766 bp. Its location on human chromosome n 11 is following the 3 -end of the Human mrna (XM_171492) defined above. To see the positions of the conserved regions on the mouse chromosome n 9 we also took a human genomic sequence Human Contig 1 of the similar region on human chromosome n 11. We used the positions of human chromosome n 11 (= positions on the human contig NT_ cf. Methodology, Genomic sequence selection part 2.2) Views on the two genomes (Figure 2) We can see in Figure 2 the conserved parts of the mouse genomic sequence and the hits of the Human cdna (AK092285) along human chromosome n 11. We can also see the conserved parts of human chromosome n 11 on mouse chromosome n 9. The annotated gene LOC is corresponding to the mrna similar to the Human Sodium Channel Beta-2 Subunit Precursor (XM_171492). The labels in green show the similar regions in both the genomes. As the two homologous regions are not oriented in the same direction in the two genomes, the numbers of the hits are reversed. But the hit 1 of the Mouse Contig 1 on the human chromosome is exactly the sequence where the hit 1 of the Human Contig 1 blasts on mouse chromosome n 9. The green labels are used to map similar parts between mouse chromosome n 9 and human chromosome n 11. We report also the positions of similar regions between the exons of the mrna of the Human Sodium Channel Beta-2 Subunit Precursor (XM_171492) and mouse chromosome n 9. The labels in black show the positions of exons constituting this Human mrna (XM_171492) and its hits in the mouse genome. 11

17 Views on human chromosome n 11 and on mouse chromosome n 9 Human cdna (AK092285) Mouse Contig 1 Human Contig 1 4 Hit 6= Hit of EST C Hit 1 =678 bp 3 2 Hit 5 Hit 2 =650 bp 1 Hit 4 Hit 3bis Hit 3 Hit 4 Hit 3= hit of 5 -end BB84945 Hit 2 Hit 1= 5 end of mouse contig Hit 5 = Exon of Human Precursor Hit 6= 5 -end of Human contig1 Figure 2: View on human chromosome n 11 and mouse chromosome n 9 of the Human cdna AK092285, of the Mouse Contig 1 and of the Human Contig ends and Confirmation of the predicted exons and conserved regions We made 2 predictions using different sizes of mouse contig (cf. Methodology, Genome sequence selection part 2.1 and Exons prediction part 3.1, and Sequence in Supplement 3: Prediction 1 and Supplement 4: Prediction 2). We then attempted to confirm the predicted exons (cf. Methodology, Prediction s analysis part 3.2). We found some ESTs that correspond to the hits Prediction 2 and of Human Contig 1 along mouse chromosome n 9. The hit 1 of the Human Contig 1 is confirmed by the mouse ESTs TC (678bp) and TC (623 bp). The hit 1 of the Prediction 2 and the hit 2 of the sequence Human Contig 1 coincide with a mouse 5 -end EST BB (650 bp). But the associated 3 -end (BB305394,

18 bp) in the RIKEN database is oriented in the bad direction. The hit 2 of the Prediction 2 and the hit 3 of the alignment of the Human Contig 1 along mouse chromosome n 9 correspond to the 5 -end EST BB (228 bp). The hit 3 of the alignment of the Human Contig 1 along mouse chromosome n 9 is also confirmed by the 5 -end EST TC (520 bp). All these ESTs sequences can be found by the method explained in part 1.1 of the Methodology about the search by identifier. View on mouse chromosome n 9 We can see in Figure 3 the positions of the different ESTs and their sizes. We can see also the sizes of the hits of the Prediction 2 and of the Human Contig 1 on mouse chromosome n 9. The sizes we will use for the different hypothetical exons and conserved regions in the hypothetical mrnas below are shown in bold. The labels in gray will be used in the following figures to provide information about the hits not considered to be hypothetical exons. The labels in green show the similar regions in both the mouse and human genomes. The hits of Human Precursor mrna (XM_171492) along the mouse chromosome n 9 are shown in black. Prediction 2 Human Contig 1 All Confirmations Hit 1 =629 bp TC (678bp) TC (623 bp) bp Hit 1= 95bp Hit 2= 284bp Hit 2 =296 bp Hit 3 =430 bp Hit bp BB bp TC (520 bp). BB (228 bp) 748 bp Hit3 173 bp Hit4 228 bp Hit 5 = 183 bp 4 4 Hit 6= 5 end of mouse contig1 Figure 3: View on mouse chromosome n 9 of the Prediction 2, of the Human Contig 1 and of the ESTs of predicted exons and hits confirmation. 13

19 1.5 Hypotheses F and C on the same mrna We can consider the F and C ESTs as belonging to the same mrna. Because the estimated sizes for these two ESTs are similar, and these 2 sequences are located close one to another on mouse chromosome n Hypothesis FC1: 1 exon of 5 Kbp The genomic sequence between the 5 -end position and the 3 -end of the C EST is 5kb. F and C are proposed to recognize the same mrna and it is proposed that this mrna constitutes only 1 exon of 5 Kbp. This could be the homologous gene of the Human cdna (AK092285). View on mouse chromosome n 9 In Figure 4, the hypothetical FC1 mrna of 5 Kbp is shown in red. The hits of the Human Precursor mrna (XM_171492) with the mouse chromosome n 9 are shown in black. The regions of similarity between C EST and the Human cdna (AK092285) are shown in purple. Extended F and C ESTs Hit 5 -end of Extended F 139 bp 3 4 Hit F EST 2959 bp MRNA FC 1 5 kb Hits EST C 650 bp Figure 4: View on mouse chromosome n 9 of the hypothetical mrna FC1 14

20 1.5.2 Hypothesis FC2: Promoter + 4 or 5 exons: >4 Kbp We also consider the possibility that there are some exons in the FC mrna. A predicted promoter in the 5 direction of F and C ESTs was found with the Prediction 1 (cf. sequence Supplement 3: Prediction 1). This could be the promoter of the FC2 mrna.. This FC2 mrna seems to be homologous to the Human Sodium Channel Beta-2 Subunit Precursor (XM_171492) because the predicted exons all coincide with regions of similarity between this Human Precursor mrna and mouse chromosome n 9 (cf. exons positions part 1.4 above). If we consider the 4 exons of the hypothetical mrna in Figure 5, we obtain 4159 bp. But it is noteworthy that predicted exons are always smaller than real exons. For this reason, we estimate that the predicted exons are longer in reality. We can consider F and C as belonging to the same exon of 3890 bp. In that case we have a mrna of 4 exons of about 4440 bp. This size is quite similar to the estimated size (cf. Introduction). View on mouse chromosome n 9 The hits of the Human Precursor mrna (XM_171492) on mouse chromosome n 9 are shown in black. The hypothetical FC2 mrna is shown in red. Extended F and C ESTs Prediction 1 Promotor 1 2 Hit1 5 end of FC2 mrna Exon bp Hit2 Exon 2 FC2 228 bp Hit 5 -end of Extended F 139 bp Exon 3 FC2 3 4 Hit F EST 2959 bp Exon4 FC2 Hits EST C 650 bp Exon 4 bis or Exon5 3 -end mrna FC2 Figure 5: View on mouse chromosome n 9 of the hypothetical mrna FC2 15

21 1.5.3 Hypothesis FC3 : known 5 -end + 5 or 6 exons : >4,7 Kbp We can consider that the 5 -end of TC (520 bp) is the 5 -end of the mrna FC3 and that this hypothetical mrna is homologous to the Human Sodium Channel Beta-2 Subunit Precursor (XM_171492). We define a mrna FC3 constituted of 6 exons. The size of the hypothetical mrna is 4907 bp (cf. Figure 6 below). But we can also consider C and F as occurring on the same exon of 3980bp. So we have a hypothetical mrna of 6 exons and the new size is 5278 bp. View on mouse chromosome n 9 The hits of the Human Precursor mrna on mouse chromosome n 9 are shown in black. The size of the hit 3 of the prediction 2 is the size of the hit 5 of the Human Contig 1 on this region is labeled in green (cf. Sizes of similar regions part 1.1.1). The hypothetical FC3 mrna is shown in red. Extended F and TC (520 bp) Prediction 2 C ESTs BB (228 bp) TC BB bp Exon 1 5 -end of FC3 mrna Hit 2= 284bp Hit 5 -end of ExtendedF 139 bp Exon 4 FC3 Hit F EST 2959 bp Exon5 FC Hit3 Exon 2 FC3 183 bp Hit4 Exon 3 FC3 228 bp Figure 6: Hits of EST C 650 bp Exon 5bis or Exon6 3 -end of FC3 mrna View on mouse chromosome n 9 of the hypothetical mrna FC3 16

22 1.5.4 Hypotheses F4 and C4 on different mrnas: F4=Promoter + 4 exons: >3,5 Kbp; C4=Known 5 -end+5 exons: >3,5 Kbp. We can consider the possibility that gene F is similar to the Human Sodium Channel Beta-2 Subunit Precursor (XM_171492), while C is homologous to Human cdna (AK092285). In that case, the Human cdna (AK092285) sequence we have is not the total mrna of that human gene. It seems that the C gene and its human homologue (AK092285) constitute exons which interpenetrate in Sodium Channel Beta-2 Subunit Precursor gene of the two organisms. View on mouse chromosome n 9 We consider that all of the regions of similarity between human chromosome n 11 and mouse chromosome n 9 that does not belong to a Human precursor exon constitute some part of the C4 mrna. In Figure 7, the hypothetical mrna C4 of 3566 bp consisting of 5 exons is shown in red. The 4 hypothetical exons of the mrna F4 of 3509 bp are shown in pink. We know the promoter of the gene F4. Since we estimated the size of the hypothetical exons only by taking the size of the ESTs or hits of the regions of similarity between the human and the mouse, we consider that the real exons of the C4 and F4 mrnas are longer. Extended F and C Prediction 2 All Confirmations Human Contig 1 TC TC Exon1 C4 5 end of C mrna 1301 bp Hit 1 =629 bp Hit 1= 95bp Hit 2= 284bp Promotor of F gene BB Exon2 C4 650 bp TC BB Exon3 C4 748 bp Hit 2 =296 bp Hit 3 =430 bp Hit 4 Exon4 C4 217 bp Figure 7: 139 bp Exon3 F bp Exon4 F4 3 -end of mrna F4 650 bp Exon5 C4 3 -end of mrnac Hit3 173 bp Hit4 Exon2 F4 228 bp Hit 5 = Exon1 F4 183 bp 5 -end of FmRNA Hit 6= 3 end of Human Contig1 View on mouse chromosome n 9 of the hypothetical mrnas F4 and C4 17

23 1.6 Primers designed Hypotheses confirmation The first thing is to confirm whether or not F and C are part of the same mrna. To test this need to be performed a RT-PCR experiment with the 3 -end of F as the 5 primer, and the 5 -end of C as the 3 -end primer (cf. Sequence F in Supplement 1 EST F, to find Sequence C cf. Methodology, part 1.1). If there is amplification, hypothesis F4/C4 is false. If not, hypotheses FC1, 2 and 3 are false. Hypothesis FC1: 1 exon of 5 Kbp Here, it is considered that the extended F EST and C EST belong to the same mrna and constitute 1 exon. So as primers, we should use the 5 -end of the EST BI (cf. Sequence in Supplement 2 Extended F EST) and the 3 -end of the C EST (Sequence C cf. Methodology, part 1.1). If the size of the RT-PCR product is more than 4,5 Kbp, it confirms the hypothesis FC1. Hypothesis FC2: Promoter + 4 or 5 exons: >4 Kbp The 5 -end is the hit 1 of the prediction 1, so the 5 primer could be designed from a part of this predicted exon (cf. sequence position 1 to 156 in Supplement 3: Prediction 1). The 3 -end primer will be the 3 -end of the C EST (Sequence C cf. Methodology, part 1.1). If the size of the RT-PCR product is around 4,5 Kbp, we confirm the hypothesis FC2. Hypothesis FC3 : known 5 -end + 5 or 6 exons : >4,7 Kbp The 5 -end is the EST TC504903, so we can take its sequence to design the 5 primer. The 3 -end primer is still the 3 -end of the C EST (Sequences cf. Methodology, part 1.1). If the size of the RT-PCR product is around 4,5 Kbp, we confirm the hypothesis FC3. Hypotheses F4 and C4 on different mrnas: F4=Promoter + 4 exons: >3,5 Kbp; C4=Known 5 -end+5 exons: > 3,5 Kbp. The EST TC is considered to be the 5 -end of the C mrna. Its sequence can be used to design the 5 primer and with C as the 3 primer of RT- PCR. If the size of the RT-PCR product is around 4,5 Kbp, we confirm the hypothesis C4. The 5 -end of F4 mrna is the first hit of the Prediction 1. So this exon can be used as the 5 primer and the 3 primer should be the 5 -end of the EST F (5 primer cf. sequence positions 1 to 156 in Supplement 3: Prediction 1, EST F Sequence in Supplement 1: EST F). If the RT-PCR product is about 2 Kbp, it confirms the hypothesis F Hypothetical Exons confirmation If with all the previous RT-PCR, we did not find the 5 -end of the mrna(s), we can try the different ESTs TC and BB as the 5 primers. We should then have an idea about the constitution of exons within the mrna(s), and so be able to confirm the existence (or not) of most of the hypothetical exons. But it is possible that we will still not have found the 5 -end. In that case the 5 -end has not been predicted nor sequenced yet, or may be further in the 5 direction along mouse chromosome n 9. More predictions on a bigger genomic sequence or a walk along mouse chromosome n 9 will then be required to find the 5 -end. 18

24 2 EST B: TC The EST B is 1489 bp length. The estimated size of the B mrna by Northern Blot is about 6 Kbp. This sequence is located on mouse chromosome n 2 at position This EST has been shown to be down regulated in some HD mouse models. 2.1 Homologous to the human protein phosphatase 1, regulatory subunit 16B (PPP1R16B) (XM_028840) The predicted mouse mrna XM_ (949 bp) is similar to the B EST. We can see on the Figure 8 that it blasts with the 3 -end of the human protein phosphatase 1, regulatory subunit 16B (PPP1R16B) (XM_028840) (6162 bp). We can note that the size of this human mrna is similar to the expected size of the B mrna. So the B gene we search seems to be the homologue of this human gene. View on human chromosome n 20 Mouse mrna XM_ Human mrna XM_ Exons 1 to 8 : nucleotide from 1 to 1246 =1246 bp Figure 8: View on human gene XM_ of the mouse mrna XM_ Exon9 Nucleotide 1247 to 6113 = 4866 bp 19

25 2.2 Hypothesis B: TC laps+B EST = 6412 bp The sequence TC (2279 bp) is the mrna of the mouse protein phosphatase 1 regulatory subunit 16B. This mrna is incomplete because the part recognized by the EST B and by the sequence TC is not present in this sequence. However the 5 -end of the mrna sequence TC is the real 5 -end of the mouse phosphatase 1 regulatory subunit 16B because we found the similar 5 - end EST sequence BB (657bp). But, we note that the total mrna TC (2279 bp)+b EST (1489 bp)= 3768 bp. Therefore, the size is shorter than the estimated size. On the views of human chromosome n 20 (cf. Figure 8 right above), the complete Human XM_ gene can be observed. We reported the positions of the 9 th exon along the mrna sequence, and its size. We did the same below (cf. Figure 9 below) for the 9 th exon of the mouse phosphatase mrna. We note that the 8 first exons contain almost the same numbers of nucleotides for the two homologues. We also note that the mouse genomic sequence between the 5 -end of the 9 th mouse exon and the 3 -end of the B EST is of a similar size to the human mrna s 9 th exon (>4,8 Kbp, cf. position in Figure 9 below). Thus the hypothetical mrna of 6412 bp was defined to consist of the first 8 exons of the mouse mrna XM_ together with a 9 th exon of 4985 bp (in red in the Figure 9 below). 20

26 View on mouse chromosome n 2 EST B TC Exons 1 to 8 : of B gene nucleotide from 1 to 1427 =1427 bp EST B Hypothetical Exon9 B gene = 4985 bp Exon 9 Nucleotide 1424 to 2255 = 833 bp Figure 9: View on mouse chromosome n 2 of the hypothetical B mrna 2.3 B primer To confirm this hypothesis it is enough to use the 3 -end of the mouse mrna TC (to obtain the sequence cf. Methodology, part 1.1.2). The 5 primer should be the 5 part of the sequence between the nucleotides 1424 and The 3 primer should be the 5 -end of the B EST. If the product of the PCR is between 2.5 and 3 kb, our hypothesis is confirmed and we will have found the total B mrna for which we search. 21

27 Acknowledgments First, I would like to thank Dr. Nobuyuki Nukina for his invitation to work in his laboratory of Structural Neuropathology. He gave me the wonderful opportunity to come to work in Japan and especially in the prestigious Brain Science Institute of RIKEN. Throughout the internship, he was always available to discuss my results and hypotheses. I thank also Dr Fumitaka Oyama for all the explanations he provided about my data. Each time I had a problem in my results, he was available to help me solve it. He also provided me the guidance I needed to organize my work during the 2 months of my internship. Thanks also to the secretary of the Structural Neuropathology group, Miss Harumi Taniguchi, who was always ready to provide immediate help in finding solutions to solve the multiple technical problems my colleague and I had during the training period. I thank in particular this colleague, Katrin Lindenberg, for our discussions about our results and for our multiple expeditions of discovery and shopping in Tokyo. I also thank all the team of the Structural Neuropathology Lab for its kindness. I particularly thank David Chapmon for his help in finding medical care, for his translation during the consultation and for correcting my English pronunciation. I thank all the summer students and the many foreign researchers at RIKEN for having helped me spend a nice time in Japan by showing me the entertaining parts of Tokyo and by advising me about the Japanese way of life. I also thank Jean-Michel Fayard, Guillaume Beslon, Hedi Soula and all my teachers for the help and advice they provided me before my departure and during the internship.

28 F EST >EST F: 2413 bp (sequenced in the lab, not submitted to GenBank yet)+ 316 bp of TC454157= 2729 bp TCTCTCCCCAGCCAGGGCTTCCTAGGGACAAGGGTTGGTTGACTGGGGGAGGAAGCCTACAGG AGATTGAAGACAGGGAAGGGAGGGGCTGGAGTGGTGTGGAAGGTTGGTTCCCGGATCCTGGGC ACGTGGGGTCTCCTTTAGATTTTCCCCTCTGTGAAGCCTTGTTTTCTCCTCAGTTTTCCTTCTGAT CTTTCACCAGGAAATCGGGGTGACCAGTGAGGGCTGCTTCCAAAGCTGGGGTTTGGAGATGGGT AGAGGGTGACCGCTTCAGAAGCTGGGAATGCACAAGAAGTCTAGAATGGTGTCTTCTGGGGGGG GGGGCAGTTGTGAGAGGCAAGCTGGGCTCTGAAGAATATCAGGCTTCTGGAAGTTCCTTTAGAG AGGACTTCTCTTTCCCTTACCCTAGAACACCTGCCCACACTGTCCTGGCTCCCCGACCAGCCTCC TCCTGCTGCCTGCCTAGTCTGTCTTTGCTCTCTGGGCTGCAGCTGCTGAGGAGGCTTGTGGGGA GGGGGCAGCCTCCACTCTCCTGGAGCACTGGGGTGCTATTTGCAGCTATACTGGCTTTGCTCTTT GGGTTTCAGAGGCAGGAGAACAGTGCCCCTGGTCTCCTAGCCTTTGGAATGTCTACCCCAGCCC TACAAGACTGACAGCCCTTGTCCTTGGCATGGCAGGACCATGCCACCCTGGCACTTCCGGAGCT CAGTTTTTCACTCTTCTTCCCTTCCCTTGAAACAGCTGGCATTGCCACCTTCCCTGAGGGATGCTT TCCTAGGACTTGTCATCTCATACCTTTGCTCCTTCTGTGTCCATCCAGCATGCCTGGCCTTCCCCT GCTCCTGGCCCCCCAGCTCTGGGTCTGCCTTTGCCTCAGGGACCCTTGTTTCCAGATGAGAAGG CCCTTGGCTTTTCCAGCTTCTTTTTTGCCCAGCTGGGCTGACTCCTCGCCTAGCCTGAGGCTGAG GAGGAGCTGGGAGAAGGTACTCACACCTTCTCTTGACTTCTGGCAGAGCCGGCTTGCACACCCC CTGAGTGTGGGGCTAGATTGTGCCTTAGTTCCTCGAGTCCTGGTTCTGAGCCCCTTTTCTTTCGG CTCACACTCCCTGAATTAATTGCACAGCTTGGTGTGACTTTGGCGGGGCTCCCCAGCTCCTTACC CCAAAGCCATGGAAGAGACCATGAAGCCGGGGTTGGTGGCAACCTTGATGACACCTGAGGGCA CCCTTTCTTGTCCCTGACATGGAGATAGGATGGCATTTGATGTGGGACCTTCAGATGGGTTTGAC CGTGTACAAACCGTAGTGCTAGCTAGGGTTTCTGTGATGTATGAAATGGGATACCCAAAGTCCCT CTTCCTCATCAGATTTCTGATACCCTTAATGTCAGAAGATGGAGATTAGTCCTCTTTTCAGGGGGG TGTAAGGACTGCTACAGGCTCTGCCCAGGAGTAGCTGAAGGTTCCCCCCCCAAATGGAAGTTGG GGGAGACTAAGGCACAGTAGGATCTGTAGGTGACTGTGGCTTTGGCTAGTGTCTGTTGCCCAAG CCAAGGGGCTCTTGGGGTTGCCTCTACTCTTCCCATTCTTCTTTACCCAGAACTCATTGTGAGCT GGGTAAAAATTGCCCATCTCCTGCTTTTTAAATATTTATTTGAGCAGAGTCTCATGTGTGGCCCAG GCGGGCCTCCACCTCTCTATGTAGCCAAGACTGGCCTTGAACTCCCAATCTCCTGCCTCCATTGC CACAGTGCTGGTATGACAGGTGTGAGCCCACACCCTGCTTAGAGTAACCTTGCTCTGAGAACCAA CATGGCACCCGAGCCTCCAGCCATTCAGGAAACTTCCAGCTGCCTTCATGTAAAACTGCTTTCTC CCCCAACACTGGAAGAGGCCAAGTGTTGGGGGTTCTTCTTGCTTTCCTGAGAGGAAGCCAAGGC ATAGAGCAGAAGAGAGGGAGGGACTCTCCCTTCCCAGCTTCCTGCTCATTGTCAGCTTATAGGCA GCCCTTGCAGCTTCTCCCATCTACCCAAAGGGTGAAATAATACCTACCTCACAGGACTGCAGTGA GGCTTGGTGAGATTTTTGTGTTTTTTGTTTTTTTGGCCTGGCTTGGAAAGGCACTGGGAAACAAG GCTAATAACCAGCGAGAATGTTCCACATCTATCCTGTCCTCATCTCTGGTTTGCATCCCAATAATA TGCATATGCCTCATTCTTCTTCCTTTAGCAACCTTAGGCATCATGACTCAGATGCTTAAAGCATCTT TGTCCCCGGTTCTTTTTTTTTTTTTTTTTTTTTTTTGATGGAGGTACCTGGGACTATGGGAGTACTT TTTTATATTGTTGTTGCCCCAATGCCTGTGATAAATACTAGCGTTTAATGGATAGGGATTAAGAGC ACAAATCTCAGTCC TCTTAACAAAGAATGTCTGGCCTAGTGCTAGCGGCATGCCTGTGCAGGCATTACCACGGATTGTG TTAGAATGTATATTTGCAAAGCCATTTTCTCTAGCCAGACCCTCTGACAGGCAAGTCTTCAAATAG CGATCTCAGGGTTGCTGAGGTTGGTCCCGGTGCCAGTGGGCTACAGCACCTCTCATACGGTTGA CTTTGGGGAAACCTGGACCCATGCAGTTGTGTTGACCTTGATGTCAGTGAGACCAAAGACAAAGC ACAAGTACCTTACTCTTGACTTCCAAATAAACTTCTGCCCTTGAGGGCTCAGAAAA Supplement 1

29 Extended F EST >Extended F EST : 366 bp of 5 -end of BI bp (sequenced in the lab, not submitted to GenBank yet)+ 316 bp of TC454157= 3095 bp ATTGGAAAAAGTGGACAACACGGTGACTCTCATCATCCTGGCTGTGGTGGGCGGGGTCAT TGGACTTCTTGTGTGCATCCTTCTGCTGAAGAAGCTCATCACCTTCATCCTGAAGAAGAC CCGAGAGAAGAAGAAGGAGTGTCTCGATGAGTTCCTCTGGGAATGACAACACAGAGAACG GGTTGCCTGGCTCCAAGGCAGAAGAGAAGCCACCCACAAAAGTGTGAGGCCCTGCTCGGGCCAAGCAGGG CAGGGAGCCTCGCTTTCTGATGGTGATCCTGATGCCAAGTCCTATCTGAG ATGTGTGCTGCTTGGCCCAAACTGTTCTTTCTGAGCAGGAAGGACCTGGCCCTGCCCAGC TGCCGT TCTCTCCCCAGCCAGGGCTTCCTAGGGACAAGGGTTGGTTGACTGGGGGAGGAAGCCTACAGGAGATTGAA GACAGGGAAGGGAGGGGCTGGAGTGGTGTGGAAGGTTGGTTCCCGGATCCTGGGCACGTGGGGTCTCCTT TAGATTTTCCCCTCTGTGAAGCCTTGTTTTCTCCTCAGTTTTCCTTCTGATCTTTCACCAGGAAATCGGGGTGA CCAGTGAGGGCTGCTTCCAAAGCTGGGGTTTGGAGATGGGTAGAGGGTGACCGCTTCAGAAGCTGGGAATG CACAAGAAGTCTAGAATGGTGTCTTCTGGGGGGGGGGGCAGTTGTGAGAGGCAAGCTGGGCTCTGAAGAAT ATCAGGCTTCTGGAAGTTCCTTTAGAGAGGACTTCTCTTTCCCTTACCCTAGAACACCTGCCCACACTGTCCT GGCTCCCCGACCAGCCTCCTCCTGCTGCCTGCCTAGTCTGTCTTTGCTCTCTGGGCTGCAGCTGCTGAGGA GGCTTGTGGGGAGGGGGCAGCCTCCACTCTCCTGGAGCACTGGGGTGCTATTTGCAGCTATACTGGCTTTG CTCTTTGGGTTTCAGAGGCAGGAGAACAGTGCCCCTGGTCTCCTAGCCTTTGGAATGTCTACCCCAGCCCTA CAAGACTGACAGCCCTTGTCCTTGGCATGGCAGGACCATGCCACCCTGGCACTTCCGGAGCTCAGTTTTTCA CTCTTCTTCCCTTCCCTTGAAACAGCTGGCATTGCCACCTTCCCTGAGGGATGCTTTCCTAGGACTTGTCATC TCATACCTTTGCTCCTTCTGTGTCCATCCAGCATGCCTGGCCTTCCCCTGCTCCTGGCCCCCCAGCTCTGGG TCTGCCTTTGCCTCAGGGACCCTTGTTTCCAGATGAGAAGGCCCTTGGCTTTTCCAGCTTCTTTTTTGCCCAG CTGGGCTGACTCCTCGCCTAGCCTGAGGCTGAGGAGGAGCTGGGAGAAGGTACTCACACCTTCTCTTGACT TCTGGCAGAGCCGGCTTGCACACCCCCTGAGTGTGGGGCTAGATTGTGCCTTAGTTCCTCGAGTCCTGGTT CTGAGCCCCTTTTCTTTCGGCTCACACTCCCTGAATTAATTGCACAGCTTGGTGTGACTTTGGCGGGGCTCC CCAGCTCCTTACCCCAAAGCCATGGAAGAGACCATGAAGCCGGGGTTGGTGGCAACCTTGATGACACCTGA GGGCACCCTTTCTTGTCCCTGACATGGAGATAGGATGGCATTTGATGTGGGACCTTCAGATGGGTTTGACCG TGTACAAACCGTAGTGCTAGCTAGGGTTTCTGTGATGTATGAAATGGGATACCCAAAGTCCCTCTTCCTCATC AGATTTCTGATACCCTTAATGTCAGAAGATGGAGATTAGTCCTCTTTTCAGGGGGGTGTAAGGACTGCTACAG GCTCTGCCCAGGAGTAGCTGAAGGTTCCCCCCCCAAATGGAAGTTGGGGGAGACTAAGGCACAGTAGGATC TGTAGGTGACTGTGGCTTTGGCTAGTGTCTGTTGCCCAAGCCAAGGGGCTCTTGGGGTTGCCTCTACTCTTC CCATTCTTCTTTACCCAGAACTCATTGTGAGCTGGGTAAAAATTGCCCATCTCCTGCTTTTTAAATATTTATTTG AGCAGAGTCTCATGTGTGGCCCAGGCGGGCCTCCACCTCTCTATGTAGCCAAGACTGGCCTTGAACTCCCA ATCTCCTGCCTCCATTGCCACAGTGCTGGTATGACAGGTGTGAGCCCACACCCTGCTTAGAGTAACCTTGCT CTGAGAACCAACATGGCACCCGAGCCTCCAGCCATTCAGGAAACTTCCAGCTGCCTTCATGTAAAACTGCTT TCTCCCCCAACACTGGAAGAGGCCAAGTGTTGGGGGTTCTTCTTGCTTTCCTGAGAGGAAGCCAAGGCATAG AGCAGAAGAGAGGGAGGGACTCTCCCTTCCCAGCTTCCTGCTCATTGTCAGCTTATAGGCAGCCCTTGCAG CTTCTCCCATCTACCCAAAGGGTGAAATAATACCTACCTCACAGGACTGCAGTGAGGCTTGGTGAGATTTTTG TGTTTTTTGTTTTTTTGGCCTGGCTTGGAAAGGCACTGGGAAACAAGGCTAATAACCAGCGAGAATGTTCCAC ATCTATCCTGTCCTCATCTCTGGTTTGCATCCCAATAATATGCATATGCCTCATTCTTCTTCCTTTAGCAACCTT AGGCATCATGACTCAGATGCTTAAAGCATCTTTGTCCCCGGTTCTTTTTTTTTTTTTTTTTTTTTTTTGATGGAG GTACCTGGGACTATGGGAGTACTTTTTTATATTGTTGTTGCCCCAATGCCTGTGATAAATACTAGCGTTTAATG GATAGGGATTAAGAGCACAAATCTCAGTCC TCTTAACAAAGAATGTCTGGCCTAGTGCTAGCGGCATGCCTGTGCAGGCATTACCACGGATTGTGTTAGAAT GTATATTTGCAAAGCCATTTTCTCTAGCCAGACCCTCTGACAGGCAAGTCTTCAAATAGCGATCTCAGGGTTG CTGAGGTTGGTCCCGGTGCCAGTGGGCTACAGCACCTCTCATACGGTTGACTTTGGGGAAACCTGGACCCA TGCAGTTGTGTTGACCTTGATGTCAGTGAGACCAAAGACAAAGCACAAGTACCTTACTCTTGACTTCCAAATA AACTTCTGCCCTTGAGGGCTCAGAAAA Supplement 2

ab initio and Evidence-Based Gene Finding

ab initio and Evidence-Based Gene Finding ab initio and Evidence-Based Gene Finding A basic introduction to annotation Outline What is annotation? ab initio gene finding Genome databases on the web Basics of the UCSC browser Evidence-based gene

More information

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding UCSC Genome Browser Introduction to ab initio and evidence-based gene finding Wilson Leung 06/2006 Outline Introduction to annotation ab initio gene finding Basics of the UCSC Browser Evidence-based gene

More information

user s guide Question 1

user s guide Question 1 Question 1 How does one find a gene of interest and determine that gene s structure? Once the gene has been located on the map, how does one easily examine other genes in that same region? doi:10.1038/ng966

More information

Genome annotation & EST

Genome annotation & EST Genome annotation & EST What is genome annotation? The process of taking the raw DNA sequence produced by the genome sequence projects and adding the layers of analysis and interpretation necessary

More information

Web-based tools for Bioinformatics; A (free) introduction to (freely available) NCBI, MUSC and World-wide.

Web-based tools for Bioinformatics; A (free) introduction to (freely available) NCBI, MUSC and World-wide. Page 1 of 18 Web-based tools for Bioinformatics; A (free) introduction to (freely available) NCBI, MUSC and World-wide. When and Where---Wednesdays 1-2pm Room 438 Library Admin Building Beginning September

More information

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences. Bio4342 Exercise 1 Answers: Detecting and Interpreting Genetic Homology (Answers prepared by Wilson Leung) Question 1: Low complexity DNA can be described as sequences that consist primarily of one or

More information

The University of California, Santa Cruz (UCSC) Genome Browser

The University of California, Santa Cruz (UCSC) Genome Browser The University of California, Santa Cruz (UCSC) Genome Browser There are hundreds of available userselected tracks in categories such as mapping and sequencing, phenotype and disease associations, genes,

More information

Annotating Fosmid 14p24 of D. Virilis chromosome 4

Annotating Fosmid 14p24 of D. Virilis chromosome 4 Lo 1 Annotating Fosmid 14p24 of D. Virilis chromosome 4 Lo, Louis April 20, 2006 Annotation Report Introduction In the first half of Research Explorations in Genomics I finished a 38kb fragment of chromosome

More information

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers Web resources: NCBI database: http://www.ncbi.nlm.nih.gov/ Ensembl database: http://useast.ensembl.org/index.html UCSC

More information

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica The Ensembl Database Dott.ssa Inga Prokopenko Corso di Genomica 1 www.ensembl.org Lecture 7.1 2 What is Ensembl? Public annotation of mammalian and other genomes Open source software Relational database

More information

Gene Annotation Project. Group 1. Tyler Tiede Yanzhu Ji Jenae Skelton

Gene Annotation Project. Group 1. Tyler Tiede Yanzhu Ji Jenae Skelton Gene Annotation Project Group 1 Tyler Tiede Yanzhu Ji Jenae Skelton Outline Tools Overview of 150kb region Overview of annotation process Characterization of 5 putative gene regions Analysis of masked

More information

Annotation Walkthrough Workshop BIO 173/273 Genomics and Bioinformatics Spring 2013 Developed by Justin R. DiAngelo at Hofstra University

Annotation Walkthrough Workshop BIO 173/273 Genomics and Bioinformatics Spring 2013 Developed by Justin R. DiAngelo at Hofstra University Annotation Walkthrough Workshop NAME: BIO 173/273 Genomics and Bioinformatics Spring 2013 Developed by Justin R. DiAngelo at Hofstra University A Simple Annotation Exercise Adapted from: Alexis Nagengast,

More information

BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology

BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology Jeremy Buhler March 15, 2004 In this lab, we ll annotate an interesting piece of the D. melanogaster genome. Along the way, you ll get

More information

Student Learning Outcomes (SLOS)

Student Learning Outcomes (SLOS) Student Learning Outcomes (SLOS) KNOWLEDGE AND LEARNING SKILLS USE OF KNOWLEDGE AND LEARNING SKILLS - how to use Annhyb to save and manage sequences - how to use BLAST to compare sequences - how to get

More information

HC70AL SUMMER 2014 PROFESSOR BOB GOLDBERG Gene Annotation Worksheet

HC70AL SUMMER 2014 PROFESSOR BOB GOLDBERG Gene Annotation Worksheet HC70AL SUMMER 2014 PROFESSOR BOB GOLDBERG Gene Annotation Worksheet NAME: DATE: QUESTION ONE Using primers given to you by your TA, you carried out sequencing reactions to determine the identity of the

More information

Aaditya Khatri. Abstract

Aaditya Khatri. Abstract Abstract In this project, Chimp-chunk 2-7 was annotated. Chimp-chunk 2-7 is an 80 kb region on chromosome 5 of the chimpanzee genome. Analysis with the Mapviewer function using the NCBI non-redundant database

More information

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G Introduction: A genome is the total genetic content of

More information

user s guide Question 3

user s guide Question 3 Question 3 During a positional cloning project aimed at finding a human disease gene, linkage data have been obtained suggesting that the gene of interest lies between two sequence-tagged site markers.

More information

Identifying Genes and Pseudogenes in a Chimpanzee Sequence Adapted from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. M.

Identifying Genes and Pseudogenes in a Chimpanzee Sequence Adapted from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. M. Identifying Genes and Pseudogenes in a Chimpanzee Sequence Adapted from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. M. Brent Prerequisites: A Simple Introduction to NCBI BLAST Resources: The GENSCAN

More information

Array-Ready Oligo Set for the Rat Genome Version 3.0

Array-Ready Oligo Set for the Rat Genome Version 3.0 Array-Ready Oligo Set for the Rat Genome Version 3.0 We are pleased to announce Version 3.0 of the Rat Genome Oligo Set containing 26,962 longmer probes representing 22,012 genes and 27,044 gene transcripts.

More information

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017 Annotation Annotation for D. virilis Chris Shaffer July 2012 l Big Picture of annotation and then one practical example l This technique may not be the best with other projects (e.g. corn, bacteria) l

More information

BLASTing through the kingdom of life

BLASTing through the kingdom of life Information for teachers Description: In this activity, students copy unknown DNA sequences and use them to search GenBank, the main database of nucleotide sequences at the National Center for Biotechnology

More information

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz] BLAST Exercise: Detecting and Interpreting Genetic Homology Adapted by W. Leung and SCR Elgin from Detecting and Interpreting Genetic Homology by Dr. J. Buhler Prequisites: None Resources: The BLAST web

More information

Multiple choice questions (numbers in brackets indicate the number of correct answers)

Multiple choice questions (numbers in brackets indicate the number of correct answers) 1 February 15, 2013 Multiple choice questions (numbers in brackets indicate the number of correct answers) 1. Which of the following statements are not true Transcriptomes consist of mrnas Proteomes consist

More information

Data Retrieval from GenBank

Data Retrieval from GenBank Data Retrieval from GenBank Peter J. Myler Bioinformatics of Intracellular Pathogens JNU, Feb 7-0, 2009 http://www.ncbi.nlm.nih.gov (January, 2007) http://ncbi.nlm.nih.gov/sitemap/resourceguide.html Accessing

More information

Figure S1 Correlation in size of analogous introns in mouse and teleost Piccolo genes. Mouse intron size was plotted against teleost intron size for t

Figure S1 Correlation in size of analogous introns in mouse and teleost Piccolo genes. Mouse intron size was plotted against teleost intron size for t Figure S1 Correlation in size of analogous introns in mouse and teleost Piccolo genes. Mouse intron size was plotted against teleost intron size for the pcloa genes of zebrafish, green spotted puffer (listed

More information

PrimePCR Assay Validation Report

PrimePCR Assay Validation Report Gene Information Gene Name integrin, alpha 1 Gene Symbol Organism Gene Summary Gene Aliases RefSeq Accession No. UniGene ID Ensembl Gene ID ITGA1 Human This gene encodes the alpha 1 subunit of integrin

More information

Chimp Sequence Annotation: Region 2_3

Chimp Sequence Annotation: Region 2_3 Chimp Sequence Annotation: Region 2_3 Jeff Howenstein March 30, 2007 BIO434W Genomics 1 Introduction We received region 2_3 of the ChimpChunk sequence, and the first step we performed was to run RepeatMasker

More information

PrimePCR Assay Validation Report

PrimePCR Assay Validation Report Gene Information Gene Name keratin 78 Gene Symbol Organism Gene Summary Gene Aliases RefSeq Accession No. UniGene ID Ensembl Gene ID KRT78 Human This gene is a member of the type II keratin gene family

More information

Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain. Elfar Þórarinsson February 2006

Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain. Elfar Þórarinsson February 2006 Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain common RNA structure Elfar Þórarinsson February 2006 It s interesting to note that: Approximately half

More information

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS.

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS. !! www.clutchprep.com CONCEPT: OVERVIEW OF GENOMICS Genomics is the study of genomes in their entirety Bioinformatics is the analysis of the information content of genomes - Genes, regulatory sequences,

More information

PrimePCR Assay Validation Report

PrimePCR Assay Validation Report Gene Information Gene Name Gene Symbol Organism Gene Summary Gene Aliases RefSeq Accession No. UniGene ID Ensembl Gene ID mcf.2 transforming sequence-like Mcf2l Mouse Description Not Available C130040G20Rik,

More information

BME 110 Midterm Examination

BME 110 Midterm Examination BME 110 Midterm Examination May 10, 2011 Name: (please print) Directions: Please circle one answer for each question, unless the question specifies "circle all correct answers". You can use any resource

More information

Guided tour to Ensembl

Guided tour to Ensembl Guided tour to Ensembl Introduction Introduction to the Ensembl project Walk-through of the browser Variations and Functional Genomics Comparative Genomics BioMart Ensembl Genome browser http://www.ensembl.org

More information

user s guide Question 3

user s guide Question 3 Question 3 During a positional cloning project aimed at finding a human disease gene, linkage data have been obtained suggesting that the gene of interest lies between two sequence-tagged site markers.

More information

PrimePCR Assay Validation Report

PrimePCR Assay Validation Report Gene Information Gene Name laminin, beta 3 Gene Symbol Organism Gene Summary Gene Aliases RefSeq Accession No. UniGene ID Ensembl Gene ID LAMB3 Human The product encoded by this gene is a laminin that

More information

Chapter 2: Access to Information

Chapter 2: Access to Information Chapter 2: Access to Information Outline Introduction to biological databases Centralized databases store DNA sequences Contents of DNA, RNA, and protein databases Central bioinformatics resources: NCBI

More information

Genome Sequence Assembly

Genome Sequence Assembly Genome Sequence Assembly Learning Goals: Introduce the field of bioinformatics Familiarize the student with performing sequence alignments Understand the assembly process in genome sequencing Introduction:

More information

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence Annotating 7G24-63 Justin Richner May 4, 2005 Zfh2 exons Thd1 exons Pur-alpha exons 0 40 kb 8 = 1 kb = LINE, Penelope = DNA/Transib, Transib1 = DINE = Novel Repeat = LTR/PAO, Diver2 I = LTR/Gypsy, Invader

More information

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018 Annotation Annotation for D. virilis Chris Shaffer July 2012 l Big Picture of annotation and then one practical example l This technique may not be the best with other projects (e.g. corn, bacteria) l

More information

Week 1 BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Week 1 BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers Week 1 BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers Web resources: NCBI database: http://www.ncbi.nlm.nih.gov/ Ensembl database: http://useast.ensembl.org/index.html

More information

Annotation of a Drosophila Gene

Annotation of a Drosophila Gene Annotation of a Drosophila Gene Wilson Leung Last Update: 12/30/2018 Prerequisites Lecture: Annotation of Drosophila Lecture: RNA-Seq Primer BLAST Walkthrough: An Introduction to NCBI BLAST Resources FlyBase:

More information

Why learn sequence database searching? Searching Molecular Databases with BLAST

Why learn sequence database searching? Searching Molecular Databases with BLAST Why learn sequence database searching? Searching Molecular Databases with BLAST What have I cloned? Is this really!my gene"? Basic Local Alignment Search Tool How BLAST works Interpreting search results

More information

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science EECS 730 Introduction to Bioinformatics Sequence Alignment Luke Huan Electrical Engineering and Computer Science http://people.eecs.ku.edu/~jhuan/ Database What is database An organized set of data Can

More information

Transcription Start Sites Project Report

Transcription Start Sites Project Report Transcription Start Sites Project Report Student name: Student email: Faculty advisor: College/university: Project details Project name: Project species: Date of submission: Number of genes in project:

More information

Chapter 20 Recombinant DNA Technology. Copyright 2009 Pearson Education, Inc.

Chapter 20 Recombinant DNA Technology. Copyright 2009 Pearson Education, Inc. Chapter 20 Recombinant DNA Technology Copyright 2009 Pearson Education, Inc. 20.1 Recombinant DNA Technology Began with Two Key Tools: Restriction Enzymes and DNA Cloning Vectors Recombinant DNA refers

More information

Draft 3 Annotation of DGA06H06, Contig 1 Jeannette Wong Bio4342W 27 April 2009

Draft 3 Annotation of DGA06H06, Contig 1 Jeannette Wong Bio4342W 27 April 2009 Page 1 Draft 3 Annotation of DGA06H06, Contig 1 Jeannette Wong Bio4342W 27 April 2009 Page 2 Introduction: Annotation is the process of analyzing the genomic sequence of an organism. Besides identifying

More information

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases Chapter 7: Similarity searches on sequence databases All science is either physics or stamp collection. Ernest Rutherford Outline Why is similarity important BLAST Protein and DNA Interpreting BLAST Individualizing

More information

Biotechnology Project Lab

Biotechnology Project Lab Only for teaching purposes - not for reproduction or sale Advanced Cell Biology & Biotechnology Biotechnology Project Lab Giovanna Gambarotta COMPETENCES THAT YOU WILL ACQUIRE - compare DNA sequences -

More information

COMPUTER RESOURCES II:

COMPUTER RESOURCES II: COMPUTER RESOURCES II: Using the computer to analyze data, using the internet, and accessing online databases Bio 210, Fall 2006 Linda S. Huang, Ph.D. University of Massachusetts Boston In the first computer

More information

Gene Identification in silico

Gene Identification in silico Gene Identification in silico Nita Parekh, IIIT Hyderabad Presented at National Seminar on Bioinformatics and Functional Genomics, at Bioinformatics centre, Pondicherry University, Feb 15 17, 2006. Introduction

More information

Annotation of contig27 in the Muller F Element of D. elegans. Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans.

Annotation of contig27 in the Muller F Element of D. elegans. Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans. David Wang Bio 434W 4/27/15 Annotation of contig27 in the Muller F Element of D. elegans Abstract Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans. Genscan predicted six

More information

Bioinformatics Course AA 2017/2018 Tutorial 2

Bioinformatics Course AA 2017/2018 Tutorial 2 UNIVERSITÀ DEGLI STUDI DI PAVIA - FACOLTÀ DI SCIENZE MM.FF.NN. - LM MOLECULAR BIOLOGY AND GENETICS Bioinformatics Course AA 2017/2018 Tutorial 2 Anna Maria Floriano annamaria.floriano01@universitadipavia.it

More information

PRESENTING SEQUENCES 5 GAATGCGGCTTAGACTGGTACGATGGAAC 3 3 CTTACGCCGAATCTGACCATGCTACCTTG 5

PRESENTING SEQUENCES 5 GAATGCGGCTTAGACTGGTACGATGGAAC 3 3 CTTACGCCGAATCTGACCATGCTACCTTG 5 Molecular Biology-2017 1 PRESENTING SEQUENCES As you know, sequences may either be double stranded or single stranded and have a polarity described as 5 and 3. The 5 end always contains a free phosphate

More information

PrimePCR Assay Validation Report

PrimePCR Assay Validation Report Gene Information Gene Name major histocompatibility complex, class II, DR beta 1 Gene Symbol Organism Gene Summary Gene Aliases RefSeq Accession No. UniGene ID Ensembl Gene ID HLA-DRB1 Human HLA-DRB1 belongs

More information

Biotechnology Explorer

Biotechnology Explorer Biotechnology Explorer C. elegans Behavior Kit Bioinformatics Supplement explorer.bio-rad.com Catalog #166-5120EDU This kit contains temperature-sensitive reagents. Open immediately and see individual

More information

PrimePCR Assay Validation Report

PrimePCR Assay Validation Report Gene Information Gene Name Gene Symbol Organism Gene Summary Gene Aliases RefSeq Accession No. UniGene ID Ensembl Gene ID PARK2 co-regulated PACRG Human This gene encodes a protein that is conserved across

More information

GENETICS EXAM 3 FALL a) is a technique that allows you to separate nucleic acids (DNA or RNA) by size.

GENETICS EXAM 3 FALL a) is a technique that allows you to separate nucleic acids (DNA or RNA) by size. Student Name: All questions are worth 5 pts. each. GENETICS EXAM 3 FALL 2004 1. a) is a technique that allows you to separate nucleic acids (DNA or RNA) by size. b) Name one of the materials (of the two

More information

PrimePCR Assay Validation Report

PrimePCR Assay Validation Report Gene Information Gene Name Gene Symbol Organism Gene Summary Gene Aliases RefSeq Accession No. UniGene ID Ensembl Gene ID myeloperoxidase MPO Human Myeloperoxidase (MPO) is a heme protein synthesized during

More information

BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES

BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES We sequenced and assembled a genome, but this is only a long stretch of ATCG What should we do now? 1. find genes What are the starting and end points for

More information

Product Applications for the Sequence Analysis Collection

Product Applications for the Sequence Analysis Collection Product Applications for the Sequence Analysis Collection Pipeline Pilot Contents Introduction... 1 Pipeline Pilot and Bioinformatics... 2 Sequence Searching with Profile HMM...2 Integrating Data in a

More information

ELE4120 Bioinformatics. Tutorial 5

ELE4120 Bioinformatics. Tutorial 5 ELE4120 Bioinformatics Tutorial 5 1 1. Database Content GenBank RefSeq TPA UniProt 2. Database Searches 2 Databases A common situation for alignment is to search through a database to retrieve the similar

More information

PrimePCR Assay Validation Report

PrimePCR Assay Validation Report Gene Information Gene Name Gene Symbol Organism Gene Summary Gene Aliases RefSeq Accession No. UniGene ID Ensembl Gene ID Glyceraldehyde-3-phosphate dehydrogenase Gapdh Rat This gene encodes a member of

More information

PrimePCR Assay Validation Report

PrimePCR Assay Validation Report Gene Information Gene Name keratin associated protein 9-2 Gene Symbol Organism Gene Summary Gene Aliases RefSeq Accession No. UniGene ID Ensembl Gene ID KRTAP9-2 Human This protein is a member of the keratin-associated

More information

PrimePCR Assay Validation Report

PrimePCR Assay Validation Report Gene Information Gene Name Gene Symbol Organism Gene Summary Gene Aliases RefSeq Accession No. UniGene ID Ensembl Gene ID sema domain, immunoglobulin domain (Ig), short basic domain, secreted, (semaphorin)

More information

Bacterial Genome Annotation

Bacterial Genome Annotation Bacterial Genome Annotation Bacterial Genome Annotation For an annotation you want to predict from the sequence, all of... protein-coding genes their stop-start the resulting protein the function the control

More information

Genome Projects. Part III. Assembly and sequencing of human genomes

Genome Projects. Part III. Assembly and sequencing of human genomes Genome Projects Part III Assembly and sequencing of human genomes All current genome sequencing strategies are clone-based. 1. ordered clone sequencing e.g., C. elegans well suited for repetitive sequences

More information

Protein Bioinformatics Part I: Access to information

Protein Bioinformatics Part I: Access to information Protein Bioinformatics Part I: Access to information 260.655 April 6, 2006 Jonathan Pevsner, Ph.D. pevsner@kennedykrieger.org Outline [1] Proteins at NCBI RefSeq accession numbers Cn3D to visualize structures

More information

PrimePCR Assay Validation Report

PrimePCR Assay Validation Report Gene Information Gene Name collagen, type IV, alpha 1 Gene Symbol Organism Gene Summary Gene Aliases RefSeq Accession No. UniGene ID Ensembl Gene ID COL4A1 Human This gene encodes the major type IV alpha

More information

Agilent GeneSpring GX 10: Beyond. Pam Tangvoranuntakul Product Manager, GeneSpring October 1, 2008

Agilent GeneSpring GX 10: Beyond. Pam Tangvoranuntakul Product Manager, GeneSpring October 1, 2008 Agilent GeneSpring GX 10: Gene Expression and Beyond Pam Tangvoranuntakul Product Manager, GeneSpring October 1, 2008 GeneSpring GX 10 in the News Our Goals for GeneSpring GX 10 Goal 1: Bring back GeneSpring

More information

PrimePCR Assay Validation Report

PrimePCR Assay Validation Report Gene Information Gene Name keratin 3 Gene Symbol Organism Gene Summary Gene Aliases RefSeq Accession No. UniGene ID Ensembl Gene ID KRT3 Human The protein encoded by this gene is a member of the keratin

More information

Chimp Chunk 3-14 Annotation by Matthew Kwong, Ruth Howe, and Hao Yang

Chimp Chunk 3-14 Annotation by Matthew Kwong, Ruth Howe, and Hao Yang Chimp Chunk 3-14 Annotation by Matthew Kwong, Ruth Howe, and Hao Yang Ruth Howe Bio 434W April 1, 2010 INTRODUCTION De novo annotation is the process by which a finished genomic sequence is searched for

More information

SAMPLE LITERATURE Please refer to included weblink for correct version.

SAMPLE LITERATURE Please refer to included weblink for correct version. Edvo-Kit #340 DNA Informatics Experiment Objective: In this experiment, students will explore the popular bioninformatics tool BLAST. First they will read sequences from autoradiographs of automated gel

More information

PrimePCR Assay Validation Report

PrimePCR Assay Validation Report Gene Information Gene Name Gene Symbol Organism Gene Summary Gene Aliases RefSeq Accession No. UniGene ID Ensembl Gene ID protein phosphatase, Mg2+/Mn2+ dependent, 1A PPM1A Human The protein encoded by

More information

Bioinformatics for Proteomics. Ann Loraine

Bioinformatics for Proteomics. Ann Loraine Bioinformatics for Proteomics Ann Loraine aloraine@uab.edu What is bioinformatics? The science of collecting, processing, organizing, storing, analyzing, and mining biological information, especially data

More information

PrimePCR Assay Validation Report

PrimePCR Assay Validation Report Gene Information Gene Name Gene Symbol Organism Gene Summary Gene Aliases RefSeq Accession No. UniGene ID Ensembl Gene ID protein kinase N1 PKN1 Human The protein encoded by this gene belongs to the protein

More information

Genomic region (ENCODE) Gene definitions

Genomic region (ENCODE) Gene definitions DNA From genes to proteins Bioinformatics Methods RNA PROMOTER ELEMENTS TRANSCRIPTION Iosif Vaisman mrna SPLICE SITES SPLICING Email: ivaisman@gmu.edu START CODON STOP CODON TRANSLATION PROTEIN From genes

More information

Chimp BAC analysis: Adapted by Wilson Leung and Sarah C.R. Elgin from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. Michael R.

Chimp BAC analysis: Adapted by Wilson Leung and Sarah C.R. Elgin from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. Michael R. Chimp BAC analysis: Adapted by Wilson Leung and Sarah C.R. Elgin from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. Michael R. Brent Prerequisites: BLAST exercise: Detecting and Interpreting Genetic

More information

Chapter 5. Structural Genomics

Chapter 5. Structural Genomics Chapter 5. Structural Genomics Contents 5. Structural Genomics 5.1. DNA Sequencing Strategies 5.1.1. Map-based Strategies 5.1.2. Whole Genome Shotgun Sequencing 5.2. Genome Annotation 5.2.1. Using Bioinformatic

More information

Microarrays: since we use probes we obviously must know the sequences we are looking at!

Microarrays: since we use probes we obviously must know the sequences we are looking at! These background are needed: 1. - Basic Molecular Biology & Genetics DNA replication Transcription Post-transcriptional RNA processing Translation Post-translational protein modification Gene expression

More information

Before starting, write your name on the top of each page Make sure you have all pages

Before starting, write your name on the top of each page Make sure you have all pages Biology 105: Introduction to Genetics Name Student ID Before starting, write your name on the top of each page Make sure you have all pages You can use the back-side of the pages for scratch, but we will

More information

PrimePCR Assay Validation Report

PrimePCR Assay Validation Report Gene Information Gene Name transforming growth factor, beta 1 Gene Symbol Organism Gene Summary Gene Aliases RefSeq Accession No. UniGene ID Ensembl Gene ID TGFB1 Human This gene encodes a member of the

More information

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine Bioinformatics Tools Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine Bioinformatics Tools Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine Overview This lecture will

More information

PrimePCR Assay Validation Report

PrimePCR Assay Validation Report Gene Information Gene Name Gene Symbol Organism Gene Summary Gene Aliases RefSeq Accession No. UniGene ID Ensembl Gene ID caspase 9, apoptosis-related cysteine peptidase CASP9 Human This gene encodes a

More information

ELB18S. Entry Level Bioinformatics. Basic Bioinformatics Sessions. Practical 4: Primer Design November (Second 2018 run of this Course)

ELB18S. Entry Level Bioinformatics. Basic Bioinformatics Sessions. Practical 4: Primer Design November (Second 2018 run of this Course) ELB18S Entry Level Bioinformatics 05-09 November 2018 (Second 2018 run of this Course) Basic Bioinformatics Sessions Primer Design The prime intention of this exercise is to design a way to amplify a DNA

More information

Lecture 7 Motif Databases and Gene Finding

Lecture 7 Motif Databases and Gene Finding Introduction to Bioinformatics for Medical Research Gideon Greenspan gdg@cs.technion.ac.il Lecture 7 Motif Databases and Gene Finding Motif Databases & Gene Finding Motifs Recap Motif Databases TRANSFAC

More information

PrimePCR Assay Validation Report

PrimePCR Assay Validation Report Gene Information Gene Name minichromosome maintenance complex component 8 Gene Symbol Organism Gene Summary Gene Aliases RefSeq Accession No. UniGene ID Ensembl Gene ID MCM8 Human The protein encoded by

More information

Browser Exercises - I. Alignments and Comparative genomics

Browser Exercises - I. Alignments and Comparative genomics Browser Exercises - I Alignments and Comparative genomics 1. Navigating to the Genome Browser (GBrowse) Note: For this exercise use http://www.tritrypdb.org a. Navigate to the Genome Browser (GBrowse)

More information

PrimePCR Assay Validation Report

PrimePCR Assay Validation Report Gene Information Gene Name growth differentiation factor 6 Gene Symbol Organism Gene Summary Gene Aliases RefSeq Accession No. UniGene ID Ensembl Gene ID GDF6 Human This gene encodes a member of the bone

More information

PrimePCR Assay Validation Report

PrimePCR Assay Validation Report Gene Information Gene Name bestrophin 3 Gene Symbol Organism Gene Summary Gene Aliases RefSeq Accession No. UniGene ID Ensembl Gene ID BEST3 Human BEST3 belongs to the bestrophin family of anion channels

More information

Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010

Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010 Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010 Genomics is a new and expanding field with an increasing impact

More information

PrimePCR Assay Validation Report

PrimePCR Assay Validation Report Gene Information Gene Name SRY (sex determining region Y)-box 6 Gene Symbol Organism Gene Summary Gene Aliases RefSeq Accession No. UniGene ID Ensembl Gene ID SOX6 Human This gene encodes a member of the

More information

BIO 202 Midterm Exam Winter 2007

BIO 202 Midterm Exam Winter 2007 BIO 202 Midterm Exam Winter 2007 Mario Chevrette Lectures 10-14 : Question 1 (1 point) Which of the following statements is incorrect. a) In contrast to prokaryotic DNA, eukaryotic DNA contains many repetitive

More information

Single-Cell Whole Transcriptome Profiling With the SOLiD. System

Single-Cell Whole Transcriptome Profiling With the SOLiD. System APPLICATION NOTE Single-Cell Whole Transcriptome Profiling Single-Cell Whole Transcriptome Profiling With the SOLiD System Introduction The ability to study the expression patterns of an individual cell

More information

What is a Gene? HC70AL Spring An Introduction to Bioinformatics -- Part I. What are the 4 Nucleotides By in DNA?

What is a Gene? HC70AL Spring An Introduction to Bioinformatics -- Part I. What are the 4 Nucleotides By in DNA? APPENDIX 2 - BIOINFORMATICS (PARTS I AND II) What is a Gene? HC70AL Spring 2004 An ordered sequence of nucleotides An Introduction to Bioinformatics -- Part I What are the 4 Nucleotides By in DNA? Brandon

More information

Assessing De-Novo Transcriptome Assemblies

Assessing De-Novo Transcriptome Assemblies Assessing De-Novo Transcriptome Assemblies Shawn T. O Neil Center for Genome Research and Biocomputing Oregon State University Scott J. Emrich University of Notre Dame 100K Contigs, Perfect 1M Contigs,

More information

Fatchiyah

Fatchiyah Fatchiyah Email: fatchiya@yahoo.co.id RNAs: mrna trna rrna RNAi DNAs: Protein: genome DNA cdna mikro-makro mono-poly single-multi Analysis: Identification human and animal disease Finger printing Sexing

More information

PrimePCR Assay Validation Report

PrimePCR Assay Validation Report Gene Information Gene Name heat shock 10kDa protein 1 (chaperonin 10) Gene Symbol Organism Gene Summary Gene Aliases RefSeq Accession No. UniGene ID Ensembl Gene ID HSPE1 Human This gene encodes a major

More information

Using the Genome Browser: A Practical Guide. Travis Saari

Using the Genome Browser: A Practical Guide. Travis Saari Using the Genome Browser: A Practical Guide Travis Saari What is it for? Problem: Bioinformatics programs produce an overwhelming amount of data Difficult to understand anything from the raw data Data

More information

PrimePCR Assay Validation Report

PrimePCR Assay Validation Report Gene Information Gene Name Gene Symbol Organism Gene Summary Gene Aliases RefSeq Accession No. UniGene ID Ensembl Gene ID Peptidyl-glycine alpha-amidating monooxygenase Peptidylglycine alpha-hydroxylating

More information