Host : Dr. Nobuyuki Nukina Tutor : Dr. Fumitaka Oyama

Size: px

Start display at page:

Download "Host : Dr. Nobuyuki Nukina Tutor : Dr. Fumitaka Oyama"

Allen Miller
6 years ago
Views:

Lab Molecular Neuropathology Group RIKEN Brain Science

1 Method to assign the coding regions of ESTs Céline Becquet Summer Program 2002 Structural Neuropathology Lab Molecular Neuropathology Group RIKEN Brain Science Institute Host : Dr. Nobuyuki Nukina Tutor : Dr. Fumitaka Oyama

3 Abstract The present study involves an investigation of deregulated genes in Huntington disease. This disease is an autosomal dominant neurodegenerative disorder caused by poly-glutamine expansion in the disease protein, huntingtin. A GeneChip experiment was previously preformed to compare extracted mrnas from cerebrum cells of wild type mice (WT) and of HD mouse models expressing the pathological form of the huntingtin protein. We identified several ESTs that may be involved in the pathogenesis of Huntington disease. To find more information about these ESTs, we developed a bioinformatical method which allows the 5 -ends of the ESTs to be found and the hypothetical Coding Sequences (CDSs) and exons of the mrnas corresponding to the ESTs to be predicted. Using this method, we were able to show homology between the mouse ESTs TC and TC and Human Sodium Channel Beta-2 Subunit Precursor (XM_171492) and Human cdna (AK092285). We built some hypotheses about these two mouse mrnas that are about to be confirmed. We also showed that the mouse protein phosphatase (EST TC478977) is an incomplete mrna sequence. The 9 th exon of its mrna is the 5 -end of the total 9 th exon whose 3 -end is the mouse EST TC Confirmation of our hypotheses are now in process.

4 Contents Abstract Contents Introduction... 1 Bioinformatical Methodology Blast search using Tigr or NCBI databases Tigr Database Making a search Mouse Gene Index Report NCBI Database Non-redundant database EST-Mouse database Chromosome location and genomic sequence selection Chromosome location Genomic sequence selection Selection process Use of the genomic sequence Exons prediction using genomic sequence Process of prediction Prediction analysis Search in RIKEN 5 -ends sequences database Process of the search Which sequence for which result Hypothetical mrna, primers Mouse mrna Hypothetical mrna Confirmation by RT-PCR... 8 Results & Hypotheses ESTs F and C Description EST F: TC

5 1.1.2 EST C: TC Extension in 5 direction Homology Homology to a mrna similar to the Human Sodium Channel Beta-2 Subunit Precursor (XM_171492) Homology to Human cdna (AK092285) Views on the two genomes (Figure 2) ends and Confirmation of the predicted exons and conserved regions Hypotheses F and C on the same mrna Hypothesis FC1: 1 exon of 5 Kbp Hypothesis FC2: Promoter + 4 or 5 exons: >4 Kbp Hypothesis FC3 : known 5 -end + 5 or 6 exons : >4,7 Kbp Hypotheses F4 and C4 on different mrnas: F4=Promoter + 4 exons: >3,5 Kbp; C4=Known 5 -end+5 exons: >3,5 Kbp Primers designed Hypotheses confirmation Hypothetical Exons confirmation EST B: TC Homologous to the human protein phosphatase 1, regulatory subunit 16B (PPP1R16B) (XM_028840) Hypothesis B: TC laps+B EST = 6412 bp B primer Conclusion Acknowledgments Supplements 1- F EST 2- Extended F EST 3- Prediction 1 4- Prediction 2

6 Introduction Through the Summer Program 2002, the Brain Science Institute of RIKEN provided me the opportunity to work in the lab of Structural Neuropathology directed by Dr. Nukina. In this laboratory, researchers are attempting to find the deregulated genes in Huntington disease. This disease is an autosomal dominant neurodegenerative disorder caused by poly-glutamine expansion in the disease protein, huntingtin. Of course, some of these interesting genes are novel and of unknown function therefore the energy of researchers is mostly involved in finding information about these genes. My supervisor, Dr. Oyama, performed a GeneChip experiment to compare the extracted mrnas from cerebrum cells of wild type mice (WT) and HD mouse models. The HD model mice express the pathological form of the huntingtin protein. On a GeneChip, many of the nucleotide targets are ESTs sequences. An EST is a part of a mrna sequence whose complete sequence, function and constitution are unknown. As a result, these target sequences hybridize with mrna probes for which we have almost no information. The results of the GeneChip analysis demonstrate that some ESTs are strongly down regulated in the HD mice. The down regulation of these genes had been confirmed by northern blotting of the transcriptome of the cerebrum cells of the WT and HD mice. This northern blot allowed the size of the interesting ESTs to be estimated. Those ESTs confirmed to be down regulated by In-situ hybridization in different HD mouse models may be involved in the pathogenesis of the Huntington disease. The first step in the investigation of these ESTs of interest is the amplification of the corresponding mrnas by Reverse Transcriptase-PCR (RT-PCR). To do so, we need the 5 -end sequence of the mrnas. However, this information is unknown for the ESTs. It is in this context that I was asked to develop a bioinformatical method to find the 5 -ends of those ESTs implicated in Huntington disease. This method uses only tools that are freely available on the Internet. It allows the hypothetical Coding Sequences (CDSs) and exons of the mrnas corresponding to these ESTs to be predicted. In the following report, I will explain my methodology. I will then display the results and hypotheses I made about 3 ESTs which are particularly interesting because of their strong and confirmed down regulation in different HD mouse models. 1

7 Bioinformatical Methodology Using the GeneChip, the mouse mrnas corresponding to ESTs of interest could be identified. It is necessary to find the 5 -end of these mrnas and also to find their constitution in terms of exons and introns. In addition, some idea of the function of these genes would be invaluable. The methodology described below demonstrates the different stages in the search for information on the ESTs. 1 Blast search using Tigr or NCBI databases To begin any work we need the EST sequence. The GeneChip manufacturers provide the TC number associated with the nucleotide target for each plot of the GeneChip. 1.1 Tigr Database To find the information about the TC number, we have to search in The Institute for Genomic Research (Tigr) Database. This database is available in the web site Making a search To make a search with the TC number in the Tigr database we have to go in the page about Gene Indices ( BLAST algorithm On the Gene Indices page, the link BLAST search displays a Query page ( In that Query page, we can choose to work only on the mouse database or with the both human and mouse databases. The BLAST algorithm finds nucleotide sequences from the Tigr Databases that have some similarities with the input sequence. For each similar sequence, a link provides the Mouse Gene Index Report (MGI Report cf. information part below). This algorithm will also be useful later to confirm the predicted exons (cf. Exons Prediction part 3 below). Search index by identifier Some links allow the selection of the specific genome of an organism. The Mouse link opens the page of Mouse Gene Index ( The Tigr Mouse Gene Index page provides several tools to search in the Tigr Mouse Databases using the sequence of the ESTs. By selecting the link Search Index by Identifier (TC, ET, EST, GB), the MGI Report page ( is displayed, where it is possible to enter the TC number of the GeneChip target and search the EST s information corresponding to this identifier. 2

8 1.1.2 Mouse Gene Index Report The MGI Report gives the sequence of mrna corresponding to the TC number. The size of the sequence and the predicted Open Reading Frames are provided. There are also alignments of all the ESTs that recognize this mrna sequence. For each of these ESTs, there is a link leading to the sequence and ID numbers of this EST. For some ESTs, the MGI report displays the link Expression summary. This page provides information concerning the expression of the mrna recognized by the EST in different cell types. The MGI report also provides for some TC numbers the Tentative Ortholog Group. This link leads to a page where similar ESTs of different organisms are aligned. Thus, it is possible to find genes homologous to the query mrna in other organisms. For most of the TC numbers, the positions of the EST on the mouse genome is provided in the MGI Report. A link leads to the Genomic Context of the EST s sequence. Here, the alignment of some homologous ESTs or genes in other organisms genomes can be found. 1.2 NCBI Database The web site of the National Center for Biotechnology Information (NCBI /) provides all the tools necessary to make a search on its databases. We can find information about any biological entity in the different available databases. To do the search, we need to select the database (PubMed, Protein, Nucleotide, Structure, Genome ). It is possible to make the search with an ID number, a keyword, or even an author name. The BLAST link ( displays tools for making an alignment between a protein or a nucleotide sequence and the sequences of the available databases (RefSeq models, GenBank or EMBL ). In the method, we often use the EST-Mouse and the Non-redundant databases. To make a search on these two databases it is enough to select the link Standard nucleotidenucleotide BLAST [blastn] ( which displays the Query page, where the query sequence can be input. We then select the database we want to work with in the corresponding field of the Query page Non-redundant database Making a search in the non-redundant database allows all of the sequences similar to the query sequence to be found. The default options allow 100 sequences to be displayed, irrespective of the score of the alignment between the 2 sequences. It is important to be critical about the results provided at this point. When 2 similar sequences longer than 200 nucleotides have only one hit and the identity is less than 25 nucleotides, the term similar means almost nothing. What we can say is that a small region is similar, so, for example, a profile may be conserved between the both sequences. 3

9 Moreover, the identifiers change depending of the type of sequences found. Caution is especially necessary with XM sequences (mrnas sequences produced by the NCBI's Genome Annotation Project). These represent the known or potential transcripts of a gene. Such a sequence is not as reliable as a sequence annotated by experiment. To find a human homologous sequence, or to find the mrna sequence corresponding to an EST sequence, we have to study the first and best-scored sequences that the blast proposes. For each of the similar sequences found, a link provides the Information Sheet of the sequence. We can know if this sequence has been predicted or if it is a experimental mrna, if the provided sequence is only the coding regions or the total mrna EST-Mouse database If the search in the non-redundant database does not provide the total mrna corresponding to the EST sequence, we have to find some mouse ESTs that could extend the sequence we have. We can use the EST-Mouse database to blast the EST sequence directly. But in that case, we usually only find the sequences of smaller ESTs than the sequence we have. This is due to the fact that the sequence provided by the Mouse Gene Index Report is a merge of all the ESTs that recognize the same part of a mrna. Moreover, other ESTs may recognize other parts of the mrna. Therefore, blasting the EST sequence against the EST-Mouse database eventually allows a sequence with a short similar region with our EST sequence to be found and which can extend the sequence in one direction or the other. Generally, we use this database by blasting a part of the genomic sequence near the position of our interesting EST (cf. Genomic sequence selection part 2.2 below). If we find ESTs similar to these genomic regions, we can assume that these regions belong to the mrna for which we search. We can also blast a predicted mrna using this database (cf. Exons prediction part 3 below). If we find some similar EST sequences, one can have confidence in the predicted exons that fit with these sequences. 2 Chromosome location and genomic sequence selection The searches above can generate a large amount of similar sequences. It is interesting to see their positions relative to the original EST sequence along the mouse chromosome. It is also interesting to find the positions of homologous genes or of the conserved regions in another organism. We are mostly interested in the human homologue because the human genome is well annotated. 2.1 Chromosome location To do so, we blast the different sequences with the database of the Human Genome and the Mouse Genome that the NCBI web site provides. 4

10 If the sequence we blast is smaller than 100 nucleotides, it is better to change the default options (Expect at 10, and filter at none). The BLAST result page displays the alignments and the positions along the contigs in which the similar regions are located. For each alignment, a link leads to a Genome View displaying the hits of the query sequence along the organism genome. By selecting the chromosome of interest on the Genome View, it is possible to see the alignment of the similar regions on this chromosome. On this Chromosome View, positions given are those relative to the total chromosome, which are different to the positions on the contig given by the Blast result page. 2.2 Genomic sequence selection Selection process On the Chromosome View, two fields allow positions of the view to be changed. We have the choice between a zoom of the hits, or a global view around the hits. To save the selected contig sequence we open the Download/View Sequence/Evidence page. This page reports the positions on the chromosome and on the contig. The links Display and Save to Disk allows the genomic sequence to be saved. The link View Evidence displays all the RefSeq models, GenBank mrnas, annotated known or potential transcripts, and ESTs that align to the area of interest Use of the genomic sequence Zoomed sequence By blasting a homologous mrna of another organism (e.g. human) against the Mouse Genome, we can find conserved regions on the mouse chromosome. We can then select the zoomed sequences of the similar regions of the mouse chromosome and blast them against the EST- Mouse database (cf. part 1.2 above) or against the Tigr Mouse database (cf. BLAST Search part above). If we find an EST similar to a genomic region, we can confidently predict that this region is a conserved coding region and belongs to an exon of the mouse mrna we search. Big genomic sequence Some exon prediction algorithms (cf. Exons Prediction part 3 below) can analyze a big genomic sequence around the hit of the original EST s sequence. Moreover, this sequence could contain the totality or some parts of the gene we search. By blasting this mouse gene sequence against the Human Genome, we can find the conserved regions. If the human homologue is known and well annotated and if these regions coincide with some exons of the homologous gene, we can consider these regions as exons or coding regions in the mouse gene

11 3 Exons prediction using genomic sequence If we did not find any homologue, or any mouse mrna sequence using the techniques described above, exon prediction may provide information concerning the mrna s constitution. Furthermore, in cases where the sequence of the total CDSs has been found, it is still of interest to define the non-coding regions of the mrna. 3.1 Process of prediction The algorithms we used are Genscan ( and Grail ( The options allow selecting the kind of organism we work on. Grail predictions may be verified with the nucleotide and EST databases and can sometimes predict the promoter of a predicted gene. The process involves analyzing mouse genomic sequences of different sizes covering the region around the EST s positions using these algorithms. The algorithms often give different results, but some exons are very well predicted (with a good score) by both methods and using different sizes of genomic sequences. 3.2 Prediction analysis To have a clear view of their positions on the mouse chromosome, it is useful to blast these different sequences of predicted mrnas against the Mouse Genome (cf. Chromosome location part 2.1 above). A comparison between the positions of the predicted exons and hits of the homologous human gene (if known) along the mouse chromosome allows confirmation of the prediction. We can also blast these sequences against the Human Genome (cf. Chromosome location part 2.1 above). If the homologue is well annotated, the hits of the predicted mrna that are similar to the exons of the human gene confirm that the regions hit are some exons in the mouse mrna we search. It is also interesting to blast the predicted mrnas against the EST-Mouse database (cf. part 1.2 above) and the Tigr Mouse database (cf. BLAST Search part above). In this manner, ESTs may be found that could confirm that the predicted exons are part of the mrna for which we search. For each ESTs corresponding to our mrna, confirmation that is derived from the same chromosome as our mrna is necessary. Because of the algorithms default options, some short EST sequences appear in the results even if the scores are bad. If we do not have the information of the location in the MGI Report (cf. MGI Report part above), we have to blast the EST s sequence against the Mouse Genome. If the EST s sequence is too short, we have to change the default option (cf. Chromosome location part 2.1 above). Then by comparing the Chromosome Views of the ESTs, the predicted mrnas, and the conserved regions between the mouse and human, we can check which exons are confirmed. 6

12 4 Search in RIKEN 5 -ends sequences database The most interesting information for which we search is the 5 -end of the mrna. If this sequence is known, it is possible to design a primer and amplify the mrna of interest by RT-PCR. 4.1 Process of the search The Gene Science Laboratory of the Genome Exploration Research Group of RIKEN works on the mouse full-length cdna encyclopedia project. This project involves collecting data on most of the mouse full-length cdnas, their primary structures and expression sites. It builds databases of mouse 5 and 3 -ends and of full-length cdnas sequences. These databases are available in the web site To make a search on these databases we select the link Search RIKEN Mouse cdna Encyclopedia on the Home page, or we select the link Our Activities. The Our Activities page displays tools to work on this Encyclopedia. The link Homology search on Our database displays the page of the RIKEN Mouse Encyclopedia Index where the link Homology search leads to a Query page. On this page, we can enter a nucleotide sequence and blast it against the RIKEN databases. A field allows selection of only one or two of the databases we want to work with. The result page gives the ID numbers and links of the cdnas sequences that align with the input sequence. For each cdna s sequence, a link leads to an Information Sheet where we can find the nucleotide sequence and other information about it. 4.2 Which sequence for which result When we have the mrna or predicted mrna corresponding to our EST, we can blast it against the RIKEN 5 -ends sequences database. If we find a 5 -end sequence corresponding to the 5 -end of our mrna, we have enough information to define a primer. If we do not find a corresponding 5 -end, the prediction together with the hits of the human homologue suggest the position of the 5 -end on the mouse genome. A Blast of a part of the genomic sequence around this positions (cf. Genomic sequence selection part 3. above) against the RIKEN 5 -end cdnas sequences database may allow identification of the 5 -end sequence. Sometimes, the Information sheet displays a 3 -end sequence associated with the 5 -end sequence we found. In that case, we have to blast these two sequences against the Mouse Genome. With this information, we can check if our EST corresponds to this 3 -end sequence. Hence, we know if we have predicted the correct gene and not the following gene on the mouse chromosome. 7

13 5 Hypothetical mrna, primers Regarding the data we managed to collect for our mouse EST, we can design hypothetical mrnas for this EST. It is then possible to select specific sequences that could be the primers to confirm the hypotheses. 5.1 Mouse mrna Sometimes it may be possible to find the mouse mrna sequence that corresponds to our EST. In that case, the primer can be the 5 -end of this mrna sequence. If the 5 -end has been confirmed by a blast against the RIKEN 5 -end cdnas sequences database, we can use the RIKEN sequence to design the primer. 5.2 Hypothetical mrna If we do not have a mouse mrna sequence that has been experimentally found, but only the predicted mrna, several different predictions for the mrna constitution may be equally compatible. These hypotheses are built by regarding which exon corresponds to a hit of the human homologue on the mouse chromosome, which exon corresponds to a mouse EST and if the RIKEN 5 -end has been found. If we have confirmed the 5 -end of the predicted mrna, we can use the RIKEN 5 -end sequence to design a primer. If we do not have confirmation of the 5 -end of the predicted mrna, we can use the first predicted exon to design a primer. If the subsequent amplification does not work, the 5 -end of human homologous mrna (if known) may also be tried. If we want to confirm our hypotheses concerning the mrna s constitution, we can use each of the predicted exons sequences or the ESTs sequences that confirm the predicted exons to design the primers. 6 Confirmation by RT-PCR The next step consists in confirming the hypothesis. To do so, we perform RT- PCR using the different primers designed above. The northern blot of the RT-PCR product shows if the primer belongs to our mrna or not. It also gives the size of the RT-PCR product. If the size of the RT-PCR product is similar to the estimated size (cf. Introduction above) there is a big probability that the primer we used is the 5 -end of the mrna we want to study. The RT-PCR products are all sequenced and an analysis of the sequences (Blast against the Mouse Genome cf. Chromosome location part 2.1 above) will display the real exons. By comparing the Chromosome Views, we will confirm or not the hypothetical exons that we have defined in the method. 8

Results & Hypotheses We now demonstrate the information and data we found using the aforementioned method about 3 ESTs particularly down regulated in HD mice.

They are both strongly down regulated in the HD 150-1 8weeks mouse model.

14 Results & Hypotheses We now demonstrate the information and data we found using the aforementioned method about 3 ESTs particularly down regulated in HD mice. 1 ESTs F and C These 2 ESTs seems to have a strong impact on the pathogenesis of the Huntington disease. They are both strongly down regulated in the HD weeks mouse model. This down regulation was confirmed in the C01 16 weeks mouse model as we can clearly see on the northern blots (cf. Figure1 below). ESTs F and C will be studied together because, as we will see below, they are linked one with the other. Northern blots of ESTs C and F between wild type and HD mice 28S 4,6 kb 5 kb 28S Figure 1: Northern blots of ESTs C and F 9

15 1.1 Description EST F: TC The size of the F mrna has been estimated by northern blot at 5kb (cf. Figure1 above). The sequence of the F EST we have is 2729 bp (cf. Sequence in Supplement 1: F EST). A sequence of 2413 bp sequenced in the lab (and not submitted to GenBank yet) and the 3 -end of the sequence we found in the MGI report TC constitute this sequence. Its position on mouse chromosome n 9 is EST C: TC The sequence C we have is 568 bp (this sequence is available in the Tigr Database cf. Methodology, part 1.1.2). The estimated size of the C mrna by northern blot is about 4,6 Kbp (cf. Figure1 above). The position of this sequence on mouse chromosome n 9 is Extension in 5 direction We found the 5 sequence of mouse cdna BI (918 bp) that extends sequence F (identity between positions of sequence BI and the 5 - end (positions 1 313) of sequence F). We merged the sequence F EST of 2729 bp and the first 366 nucleotides of BI Thus, the 3 -end of this EST had been deleted. We obtained an extended F EST of 3095 bp (cf. Sequence in Supplement 2: Extended F EST). Its positions on the mouse chromosome n 9 are and This extension will have to be confirmed. 1.3 Homology We took the mouse genomic sequence Mouse Contig 1 from positions to on the mouse chromosome n 9 (= positions on the mouse contig NW_ cf. Methodology, Genomic sequence selection part 2.2) to check for conserved parts of mouse chromosome n 9 in the human genome (cf. Methodology, Chromosome location part 2.1). We found that mouse chromosome n 9 is similar to some part of human chromosome n Homology to a mrna similar to the Human Sodium Channel Beta-2 Subunit Precursor (XM_171492) We can see below in the view of human chromosome n 11 that most of the hits of the mouse genomic sequence Mouse Contig 1 are in the gene of a human mrna similar to the Human Sodium Channel Beta-2 Subunit Precursor (XM_171492). But most of these hits do not fit with an exon of this mrna. This could 10

16 be due to the fact that this mrna has been predicted (the predicted exons could be too short and the similar regions could be extensions of these predicted exons). Alternatively, our observation could be due to the fact that a gene was interpenetrated in the Human Precursor gene. The predicted Human Precursor mrna (XM_171492) is only 1182 bp. Its position on human chromosome n 11 is (cf. view part below) Homology to Human cdna (AK092285) We found that the EST C is similar to the 3 -end of Human cdna (AK092285) whose function is unknown. Its size is 2766 bp. Its location on human chromosome n 11 is following the 3 -end of the Human mrna (XM_171492) defined above. To see the positions of the conserved regions on the mouse chromosome n 9 we also took a human genomic sequence Human Contig 1 of the similar region on human chromosome n 11. We used the positions of human chromosome n 11 (= positions on the human contig NT_ cf. Methodology, Genomic sequence selection part 2.2) Views on the two genomes (Figure 2) We can see in Figure 2 the conserved parts of the mouse genomic sequence and the hits of the Human cdna (AK092285) along human chromosome n 11. We can also see the conserved parts of human chromosome n 11 on mouse chromosome n 9. The annotated gene LOC is corresponding to the mrna similar to the Human Sodium Channel Beta-2 Subunit Precursor (XM_171492). The labels in green show the similar regions in both the genomes. As the two homologous regions are not oriented in the same direction in the two genomes, the numbers of the hits are reversed. But the hit 1 of the Mouse Contig 1 on the human chromosome is exactly the sequence where the hit 1 of the Human Contig 1 blasts on mouse chromosome n 9. The green labels are used to map similar parts between mouse chromosome n 9 and human chromosome n 11. We report also the positions of similar regions between the exons of the mrna of the Human Sodium Channel Beta-2 Subunit Precursor (XM_171492) and mouse chromosome n 9. The labels in black show the positions of exons constituting this Human mrna (XM_171492) and its hits in the mouse genome. 11

17 Views on human chromosome n 11 and on mouse chromosome n 9 Human cdna (AK092285) Mouse Contig 1 Human Contig 1 4 Hit 6= Hit of EST C Hit 1 =678 bp 3 2 Hit 5 Hit 2 =650 bp 1 Hit 4 Hit 3bis Hit 3 Hit 4 Hit 3= hit of 5 -end BB84945 Hit 2 Hit 1= 5 end of mouse contig Hit 5 = Exon of Human Precursor Hit 6= 5 -end of Human contig1 Figure 2: View on human chromosome n 11 and mouse chromosome n 9 of the Human cdna AK092285, of the Mouse Contig 1 and of the Human Contig ends and Confirmation of the predicted exons and conserved regions We made 2 predictions using different sizes of mouse contig (cf. Methodology, Genome sequence selection part 2.1 and Exons prediction part 3.1, and Sequence in Supplement 3: Prediction 1 and Supplement 4: Prediction 2). We then attempted to confirm the predicted exons (cf. Methodology, Prediction s analysis part 3.2). We found some ESTs that correspond to the hits Prediction 2 and of Human Contig 1 along mouse chromosome n 9. The hit 1 of the Human Contig 1 is confirmed by the mouse ESTs TC (678bp) and TC (623 bp). The hit 1 of the Prediction 2 and the hit 2 of the sequence Human Contig 1 coincide with a mouse 5 -end EST BB (650 bp). But the associated 3 -end (BB305394,

18 bp) in the RIKEN database is oriented in the bad direction. The hit 2 of the Prediction 2 and the hit 3 of the alignment of the Human Contig 1 along mouse chromosome n 9 correspond to the 5 -end EST BB (228 bp). The hit 3 of the alignment of the Human Contig 1 along mouse chromosome n 9 is also confirmed by the 5 -end EST TC (520 bp). All these ESTs sequences can be found by the method explained in part 1.1 of the Methodology about the search by identifier. View on mouse chromosome n 9 We can see in Figure 3 the positions of the different ESTs and their sizes. We can see also the sizes of the hits of the Prediction 2 and of the Human Contig 1 on mouse chromosome n 9. The sizes we will use for the different hypothetical exons and conserved regions in the hypothetical mrnas below are shown in bold. The labels in gray will be used in the following figures to provide information about the hits not considered to be hypothetical exons. The labels in green show the similar regions in both the mouse and human genomes. The hits of Human Precursor mrna (XM_171492) along the mouse chromosome n 9 are shown in black. Prediction 2 Human Contig 1 All Confirmations Hit 1 =629 bp TC (678bp) TC (623 bp) bp Hit 1= 95bp Hit 2= 284bp Hit 2 =296 bp Hit 3 =430 bp Hit bp BB bp TC (520 bp). BB (228 bp) 748 bp Hit3 173 bp Hit4 228 bp Hit 5 = 183 bp 4 4 Hit 6= 5 end of mouse contig1 Figure 3: View on mouse chromosome n 9 of the Prediction 2, of the Human Contig 1 and of the ESTs of predicted exons and hits confirmation. 13

19 1.5 Hypotheses F and C on the same mrna We can consider the F and C ESTs as belonging to the same mrna. Because the estimated sizes for these two ESTs are similar, and these 2 sequences are located close one to another on mouse chromosome n Hypothesis FC1: 1 exon of 5 Kbp The genomic sequence between the 5 -end position and the 3 -end of the C EST is 5kb. F and C are proposed to recognize the same mrna and it is proposed that this mrna constitutes only 1 exon of 5 Kbp. This could be the homologous gene of the Human cdna (AK092285). View on mouse chromosome n 9 In Figure 4, the hypothetical FC1 mrna of 5 Kbp is shown in red. The hits of the Human Precursor mrna (XM_171492) with the mouse chromosome n 9 are shown in black. The regions of similarity between C EST and the Human cdna (AK092285) are shown in purple. Extended F and C ESTs Hit 5 -end of Extended F 139 bp 3 4 Hit F EST 2959 bp MRNA FC 1 5 kb Hits EST C 650 bp Figure 4: View on mouse chromosome n 9 of the hypothetical mrna FC1 14

20 1.5.2 Hypothesis FC2: Promoter + 4 or 5 exons: >4 Kbp We also consider the possibility that there are some exons in the FC mrna. A predicted promoter in the 5 direction of F and C ESTs was found with the Prediction 1 (cf. sequence Supplement 3: Prediction 1). This could be the promoter of the FC2 mrna.. This FC2 mrna seems to be homologous to the Human Sodium Channel Beta-2 Subunit Precursor (XM_171492) because the predicted exons all coincide with regions of similarity between this Human Precursor mrna and mouse chromosome n 9 (cf. exons positions part 1.4 above). If we consider the 4 exons of the hypothetical mrna in Figure 5, we obtain 4159 bp. But it is noteworthy that predicted exons are always smaller than real exons. For this reason, we estimate that the predicted exons are longer in reality. We can consider F and C as belonging to the same exon of 3890 bp. In that case we have a mrna of 4 exons of about 4440 bp. This size is quite similar to the estimated size (cf. Introduction). View on mouse chromosome n 9 The hits of the Human Precursor mrna (XM_171492) on mouse chromosome n 9 are shown in black. The hypothetical FC2 mrna is shown in red. Extended F and C ESTs Prediction 1 Promotor 1 2 Hit1 5 end of FC2 mrna Exon bp Hit2 Exon 2 FC2 228 bp Hit 5 -end of Extended F 139 bp Exon 3 FC2 3 4 Hit F EST 2959 bp Exon4 FC2 Hits EST C 650 bp Exon 4 bis or Exon5 3 -end mrna FC2 Figure 5: View on mouse chromosome n 9 of the hypothetical mrna FC2 15

21 1.5.3 Hypothesis FC3 : known 5 -end + 5 or 6 exons : >4,7 Kbp We can consider that the 5 -end of TC (520 bp) is the 5 -end of the mrna FC3 and that this hypothetical mrna is homologous to the Human Sodium Channel Beta-2 Subunit Precursor (XM_171492). We define a mrna FC3 constituted of 6 exons. The size of the hypothetical mrna is 4907 bp (cf. Figure 6 below). But we can also consider C and F as occurring on the same exon of 3980bp. So we have a hypothetical mrna of 6 exons and the new size is 5278 bp. View on mouse chromosome n 9 The hits of the Human Precursor mrna on mouse chromosome n 9 are shown in black. The size of the hit 3 of the prediction 2 is the size of the hit 5 of the Human Contig 1 on this region is labeled in green (cf. Sizes of similar regions part 1.1.1). The hypothetical FC3 mrna is shown in red. Extended F and TC (520 bp) Prediction 2 C ESTs BB (228 bp) TC BB bp Exon 1 5 -end of FC3 mrna Hit 2= 284bp Hit 5 -end of ExtendedF 139 bp Exon 4 FC3 Hit F EST 2959 bp Exon5 FC Hit3 Exon 2 FC3 183 bp Hit4 Exon 3 FC3 228 bp Figure 6: Hits of EST C 650 bp Exon 5bis or Exon6 3 -end of FC3 mrna View on mouse chromosome n 9 of the hypothetical mrna FC3 16

22 1.5.4 Hypotheses F4 and C4 on different mrnas: F4=Promoter + 4 exons: >3,5 Kbp; C4=Known 5 -end+5 exons: >3,5 Kbp. We can consider the possibility that gene F is similar to the Human Sodium Channel Beta-2 Subunit Precursor (XM_171492), while C is homologous to Human cdna (AK092285). In that case, the Human cdna (AK092285) sequence we have is not the total mrna of that human gene. It seems that the C gene and its human homologue (AK092285) constitute exons which interpenetrate in Sodium Channel Beta-2 Subunit Precursor gene of the two organisms. View on mouse chromosome n 9 We consider that all of the regions of similarity between human chromosome n 11 and mouse chromosome n 9 that does not belong to a Human precursor exon constitute some part of the C4 mrna. In Figure 7, the hypothetical mrna C4 of 3566 bp consisting of 5 exons is shown in red. The 4 hypothetical exons of the mrna F4 of 3509 bp are shown in pink. We know the promoter of the gene F4. Since we estimated the size of the hypothetical exons only by taking the size of the ESTs or hits of the regions of similarity between the human and the mouse, we consider that the real exons of the C4 and F4 mrnas are longer. Extended F and C Prediction 2 All Confirmations Human Contig 1 TC TC Exon1 C4 5 end of C mrna 1301 bp Hit 1 =629 bp Hit 1= 95bp Hit 2= 284bp Promotor of F gene BB Exon2 C4 650 bp TC BB Exon3 C4 748 bp Hit 2 =296 bp Hit 3 =430 bp Hit 4 Exon4 C4 217 bp Figure 7: 139 bp Exon3 F bp Exon4 F4 3 -end of mrna F4 650 bp Exon5 C4 3 -end of mrnac Hit3 173 bp Hit4 Exon2 F4 228 bp Hit 5 = Exon1 F4 183 bp 5 -end of FmRNA Hit 6= 3 end of Human Contig1 View on mouse chromosome n 9 of the hypothetical mrnas F4 and C4 17

23 1.6 Primers designed Hypotheses confirmation The first thing is to confirm whether or not F and C are part of the same mrna. To test this need to be performed a RT-PCR experiment with the 3 -end of F as the 5 primer, and the 5 -end of C as the 3 -end primer (cf. Sequence F in Supplement 1 EST F, to find Sequence C cf. Methodology, part 1.1). If there is amplification, hypothesis F4/C4 is false. If not, hypotheses FC1, 2 and 3 are false. Hypothesis FC1: 1 exon of 5 Kbp Here, it is considered that the extended F EST and C EST belong to the same mrna and constitute 1 exon. So as primers, we should use the 5 -end of the EST BI (cf. Sequence in Supplement 2 Extended F EST) and the 3 -end of the C EST (Sequence C cf. Methodology, part 1.1). If the size of the RT-PCR product is more than 4,5 Kbp, it confirms the hypothesis FC1. Hypothesis FC2: Promoter + 4 or 5 exons: >4 Kbp The 5 -end is the hit 1 of the prediction 1, so the 5 primer could be designed from a part of this predicted exon (cf. sequence position 1 to 156 in Supplement 3: Prediction 1). The 3 -end primer will be the 3 -end of the C EST (Sequence C cf. Methodology, part 1.1). If the size of the RT-PCR product is around 4,5 Kbp, we confirm the hypothesis FC2. Hypothesis FC3 : known 5 -end + 5 or 6 exons : >4,7 Kbp The 5 -end is the EST TC504903, so we can take its sequence to design the 5 primer. The 3 -end primer is still the 3 -end of the C EST (Sequences cf. Methodology, part 1.1). If the size of the RT-PCR product is around 4,5 Kbp, we confirm the hypothesis FC3. Hypotheses F4 and C4 on different mrnas: F4=Promoter + 4 exons: >3,5 Kbp; C4=Known 5 -end+5 exons: > 3,5 Kbp. The EST TC is considered to be the 5 -end of the C mrna. Its sequence can be used to design the 5 primer and with C as the 3 primer of RT- PCR. If the size of the RT-PCR product is around 4,5 Kbp, we confirm the hypothesis C4. The 5 -end of F4 mrna is the first hit of the Prediction 1. So this exon can be used as the 5 primer and the 3 primer should be the 5 -end of the EST F (5 primer cf. sequence positions 1 to 156 in Supplement 3: Prediction 1, EST F Sequence in Supplement 1: EST F). If the RT-PCR product is about 2 Kbp, it confirms the hypothesis F Hypothetical Exons confirmation If with all the previous RT-PCR, we did not find the 5 -end of the mrna(s), we can try the different ESTs TC and BB as the 5 primers. We should then have an idea about the constitution of exons within the mrna(s), and so be able to confirm the existence (or not) of most of the hypothetical exons. But it is possible that we will still not have found the 5 -end. In that case the 5 -end has not been predicted nor sequenced yet, or may be further in the 5 direction along mouse chromosome n 9. More predictions on a bigger genomic sequence or a walk along mouse chromosome n 9 will then be required to find the 5 -end. 18

24 2 EST B: TC The EST B is 1489 bp length. The estimated size of the B mrna by Northern Blot is about 6 Kbp. This sequence is located on mouse chromosome n 2 at position This EST has been shown to be down regulated in some HD mouse models. 2.1 Homologous to the human protein phosphatase 1, regulatory subunit 16B (PPP1R16B) (XM_028840) The predicted mouse mrna XM_ (949 bp) is similar to the B EST. We can see on the Figure 8 that it blasts with the 3 -end of the human protein phosphatase 1, regulatory subunit 16B (PPP1R16B) (XM_028840) (6162 bp). We can note that the size of this human mrna is similar to the expected size of the B mrna. So the B gene we search seems to be the homologue of this human gene. View on human chromosome n 20 Mouse mrna XM_ Human mrna XM_ Exons 1 to 8 : nucleotide from 1 to 1246 =1246 bp Figure 8: View on human gene XM_ of the mouse mrna XM_ Exon9 Nucleotide 1247 to 6113 = 4866 bp 19

25 2.2 Hypothesis B: TC laps+B EST = 6412 bp The sequence TC (2279 bp) is the mrna of the mouse protein phosphatase 1 regulatory subunit 16B. This mrna is incomplete because the part recognized by the EST B and by the sequence TC is not present in this sequence. However the 5 -end of the mrna sequence TC is the real 5 -end of the mouse phosphatase 1 regulatory subunit 16B because we found the similar 5 - end EST sequence BB (657bp). But, we note that the total mrna TC (2279 bp)+b EST (1489 bp)= 3768 bp. Therefore, the size is shorter than the estimated size. On the views of human chromosome n 20 (cf. Figure 8 right above), the complete Human XM_ gene can be observed. We reported the positions of the 9 th exon along the mrna sequence, and its size. We did the same below (cf. Figure 9 below) for the 9 th exon of the mouse phosphatase mrna. We note that the 8 first exons contain almost the same numbers of nucleotides for the two homologues. We also note that the mouse genomic sequence between the 5 -end of the 9 th mouse exon and the 3 -end of the B EST is of a similar size to the human mrna s 9 th exon (>4,8 Kbp, cf. position in Figure 9 below). Thus the hypothetical mrna of 6412 bp was defined to consist of the first 8 exons of the mouse mrna XM_ together with a 9 th exon of 4985 bp (in red in the Figure 9 below). 20

26 View on mouse chromosome n 2 EST B TC Exons 1 to 8 : of B gene nucleotide from 1 to 1427 =1427 bp EST B Hypothetical Exon9 B gene = 4985 bp Exon 9 Nucleotide 1424 to 2255 = 833 bp Figure 9: View on mouse chromosome n 2 of the hypothetical B mrna 2.3 B primer To confirm this hypothesis it is enough to use the 3 -end of the mouse mrna TC (to obtain the sequence cf. Methodology, part 1.1.2). The 5 primer should be the 5 part of the sequence between the nucleotides 1424 and The 3 primer should be the 5 -end of the B EST. If the product of the PCR is between 2.5 and 3 kb, our hypothesis is confirmed and we will have found the total B mrna for which we search. 21

27 Acknowledgments First, I would like to thank Dr. Nobuyuki Nukina for his invitation to work in his laboratory of Structural Neuropathology. He gave me the wonderful opportunity to come to work in Japan and especially in the prestigious Brain Science Institute of RIKEN. Throughout the internship, he was always available to discuss my results and hypotheses. I thank also Dr Fumitaka Oyama for all the explanations he provided about my data. Each time I had a problem in my results, he was available to help me solve it. He also provided me the guidance I needed to organize my work during the 2 months of my internship. Thanks also to the secretary of the Structural Neuropathology group, Miss Harumi Taniguchi, who was always ready to provide immediate help in finding solutions to solve the multiple technical problems my colleague and I had during the training period. I thank in particular this colleague, Katrin Lindenberg, for our discussions about our results and for our multiple expeditions of discovery and shopping in Tokyo. I also thank all the team of the Structural Neuropathology Lab for its kindness. I particularly thank David Chapmon for his help in finding medical care, for his translation during the consultation and for correcting my English pronunciation. I thank all the summer students and the many foreign researchers at RIKEN for having helped me spend a nice time in Japan by showing me the entertaining parts of Tokyo and by advising me about the Japanese way of life. I also thank Jean-Michel Fayard, Guillaume Beslon, Hedi Soula and all my teachers for the help and advice they provided me before my departure and during the internship.

28 F EST >EST F: 2413 bp (sequenced in the lab, not submitted to GenBank yet)+ 316 bp of TC454157= 2729 bp TCTCTCCCCAGCCAGGGCTTCCTAGGGACAAGGGTTGGTTGACTGGGGGAGGAAGCCTACAGG AGATTGAAGACAGGGAAGGGAGGGGCTGGAGTGGTGTGGAAGGTTGGTTCCCGGATCCTGGGC ACGTGGGGTCTCCTTTAGATTTTCCCCTCTGTGAAGCCTTGTTTTCTCCTCAGTTTTCCTTCTGAT CTTTCACCAGGAAATCGGGGTGACCAGTGAGGGCTGCTTCCAAAGCTGGGGTTTGGAGATGGGT AGAGGGTGACCGCTTCAGAAGCTGGGAATGCACAAGAAGTCTAGAATGGTGTCTTCTGGGGGGG GGGGCAGTTGTGAGAGGCAAGCTGGGCTCTGAAGAATATCAGGCTTCTGGAAGTTCCTTTAGAG AGGACTTCTCTTTCCCTTACCCTAGAACACCTGCCCACACTGTCCTGGCTCCCCGACCAGCCTCC TCCTGCTGCCTGCCTAGTCTGTCTTTGCTCTCTGGGCTGCAGCTGCTGAGGAGGCTTGTGGGGA GGGGGCAGCCTCCACTCTCCTGGAGCACTGGGGTGCTATTTGCAGCTATACTGGCTTTGCTCTTT GGGTTTCAGAGGCAGGAGAACAGTGCCCCTGGTCTCCTAGCCTTTGGAATGTCTACCCCAGCCC TACAAGACTGACAGCCCTTGTCCTTGGCATGGCAGGACCATGCCACCCTGGCACTTCCGGAGCT CAGTTTTTCACTCTTCTTCCCTTCCCTTGAAACAGCTGGCATTGCCACCTTCCCTGAGGGATGCTT TCCTAGGACTTGTCATCTCATACCTTTGCTCCTTCTGTGTCCATCCAGCATGCCTGGCCTTCCCCT GCTCCTGGCCCCCCAGCTCTGGGTCTGCCTTTGCCTCAGGGACCCTTGTTTCCAGATGAGAAGG CCCTTGGCTTTTCCAGCTTCTTTTTTGCCCAGCTGGGCTGACTCCTCGCCTAGCCTGAGGCTGAG GAGGAGCTGGGAGAAGGTACTCACACCTTCTCTTGACTTCTGGCAGAGCCGGCTTGCACACCCC CTGAGTGTGGGGCTAGATTGTGCCTTAGTTCCTCGAGTCCTGGTTCTGAGCCCCTTTTCTTTCGG CTCACACTCCCTGAATTAATTGCACAGCTTGGTGTGACTTTGGCGGGGCTCCCCAGCTCCTTACC CCAAAGCCATGGAAGAGACCATGAAGCCGGGGTTGGTGGCAACCTTGATGACACCTGAGGGCA CCCTTTCTTGTCCCTGACATGGAGATAGGATGGCATTTGATGTGGGACCTTCAGATGGGTTTGAC CGTGTACAAACCGTAGTGCTAGCTAGGGTTTCTGTGATGTATGAAATGGGATACCCAAAGTCCCT CTTCCTCATCAGATTTCTGATACCCTTAATGTCAGAAGATGGAGATTAGTCCTCTTTTCAGGGGGG TGTAAGGACTGCTACAGGCTCTGCCCAGGAGTAGCTGAAGGTTCCCCCCCCAAATGGAAGTTGG GGGAGACTAAGGCACAGTAGGATCTGTAGGTGACTGTGGCTTTGGCTAGTGTCTGTTGCCCAAG CCAAGGGGCTCTTGGGGTTGCCTCTACTCTTCCCATTCTTCTTTACCCAGAACTCATTGTGAGCT GGGTAAAAATTGCCCATCTCCTGCTTTTTAAATATTTATTTGAGCAGAGTCTCATGTGTGGCCCAG GCGGGCCTCCACCTCTCTATGTAGCCAAGACTGGCCTTGAACTCCCAATCTCCTGCCTCCATTGC CACAGTGCTGGTATGACAGGTGTGAGCCCACACCCTGCTTAGAGTAACCTTGCTCTGAGAACCAA CATGGCACCCGAGCCTCCAGCCATTCAGGAAACTTCCAGCTGCCTTCATGTAAAACTGCTTTCTC CCCCAACACTGGAAGAGGCCAAGTGTTGGGGGTTCTTCTTGCTTTCCTGAGAGGAAGCCAAGGC ATAGAGCAGAAGAGAGGGAGGGACTCTCCCTTCCCAGCTTCCTGCTCATTGTCAGCTTATAGGCA GCCCTTGCAGCTTCTCCCATCTACCCAAAGGGTGAAATAATACCTACCTCACAGGACTGCAGTGA GGCTTGGTGAGATTTTTGTGTTTTTTGTTTTTTTGGCCTGGCTTGGAAAGGCACTGGGAAACAAG GCTAATAACCAGCGAGAATGTTCCACATCTATCCTGTCCTCATCTCTGGTTTGCATCCCAATAATA TGCATATGCCTCATTCTTCTTCCTTTAGCAACCTTAGGCATCATGACTCAGATGCTTAAAGCATCTT TGTCCCCGGTTCTTTTTTTTTTTTTTTTTTTTTTTTGATGGAGGTACCTGGGACTATGGGAGTACTT TTTTATATTGTTGTTGCCCCAATGCCTGTGATAAATACTAGCGTTTAATGGATAGGGATTAAGAGC ACAAATCTCAGTCC TCTTAACAAAGAATGTCTGGCCTAGTGCTAGCGGCATGCCTGTGCAGGCATTACCACGGATTGTG TTAGAATGTATATTTGCAAAGCCATTTTCTCTAGCCAGACCCTCTGACAGGCAAGTCTTCAAATAG CGATCTCAGGGTTGCTGAGGTTGGTCCCGGTGCCAGTGGGCTACAGCACCTCTCATACGGTTGA CTTTGGGGAAACCTGGACCCATGCAGTTGTGTTGACCTTGATGTCAGTGAGACCAAAGACAAAGC ACAAGTACCTTACTCTTGACTTCCAAATAAACTTCTGCCCTTGAGGGCTCAGAAAA Supplement 1

29 Extended F EST >Extended F EST : 366 bp of 5 -end of BI bp (sequenced in the lab, not submitted to GenBank yet)+ 316 bp of TC454157= 3095 bp ATTGGAAAAAGTGGACAACACGGTGACTCTCATCATCCTGGCTGTGGTGGGCGGGGTCAT TGGACTTCTTGTGTGCATCCTTCTGCTGAAGAAGCTCATCACCTTCATCCTGAAGAAGAC CCGAGAGAAGAAGAAGGAGTGTCTCGATGAGTTCCTCTGGGAATGACAACACAGAGAACG GGTTGCCTGGCTCCAAGGCAGAAGAGAAGCCACCCACAAAAGTGTGAGGCCCTGCTCGGGCCAAGCAGGG CAGGGAGCCTCGCTTTCTGATGGTGATCCTGATGCCAAGTCCTATCTGAG ATGTGTGCTGCTTGGCCCAAACTGTTCTTTCTGAGCAGGAAGGACCTGGCCCTGCCCAGC TGCCGT TCTCTCCCCAGCCAGGGCTTCCTAGGGACAAGGGTTGGTTGACTGGGGGAGGAAGCCTACAGGAGATTGAA GACAGGGAAGGGAGGGGCTGGAGTGGTGTGGAAGGTTGGTTCCCGGATCCTGGGCACGTGGGGTCTCCTT TAGATTTTCCCCTCTGTGAAGCCTTGTTTTCTCCTCAGTTTTCCTTCTGATCTTTCACCAGGAAATCGGGGTGA CCAGTGAGGGCTGCTTCCAAAGCTGGGGTTTGGAGATGGGTAGAGGGTGACCGCTTCAGAAGCTGGGAATG CACAAGAAGTCTAGAATGGTGTCTTCTGGGGGGGGGGGCAGTTGTGAGAGGCAAGCTGGGCTCTGAAGAAT ATCAGGCTTCTGGAAGTTCCTTTAGAGAGGACTTCTCTTTCCCTTACCCTAGAACACCTGCCCACACTGTCCT GGCTCCCCGACCAGCCTCCTCCTGCTGCCTGCCTAGTCTGTCTTTGCTCTCTGGGCTGCAGCTGCTGAGGA GGCTTGTGGGGAGGGGGCAGCCTCCACTCTCCTGGAGCACTGGGGTGCTATTTGCAGCTATACTGGCTTTG CTCTTTGGGTTTCAGAGGCAGGAGAACAGTGCCCCTGGTCTCCTAGCCTTTGGAATGTCTACCCCAGCCCTA CAAGACTGACAGCCCTTGTCCTTGGCATGGCAGGACCATGCCACCCTGGCACTTCCGGAGCTCAGTTTTTCA CTCTTCTTCCCTTCCCTTGAAACAGCTGGCATTGCCACCTTCCCTGAGGGATGCTTTCCTAGGACTTGTCATC TCATACCTTTGCTCCTTCTGTGTCCATCCAGCATGCCTGGCCTTCCCCTGCTCCTGGCCCCCCAGCTCTGGG TCTGCCTTTGCCTCAGGGACCCTTGTTTCCAGATGAGAAGGCCCTTGGCTTTTCCAGCTTCTTTTTTGCCCAG CTGGGCTGACTCCTCGCCTAGCCTGAGGCTGAGGAGGAGCTGGGAGAAGGTACTCACACCTTCTCTTGACT TCTGGCAGAGCCGGCTTGCACACCCCCTGAGTGTGGGGCTAGATTGTGCCTTAGTTCCTCGAGTCCTGGTT CTGAGCCCCTTTTCTTTCGGCTCACACTCCCTGAATTAATTGCACAGCTTGGTGTGACTTTGGCGGGGCTCC CCAGCTCCTTACCCCAAAGCCATGGAAGAGACCATGAAGCCGGGGTTGGTGGCAACCTTGATGACACCTGA GGGCACCCTTTCTTGTCCCTGACATGGAGATAGGATGGCATTTGATGTGGGACCTTCAGATGGGTTTGACCG TGTACAAACCGTAGTGCTAGCTAGGGTTTCTGTGATGTATGAAATGGGATACCCAAAGTCCCTCTTCCTCATC AGATTTCTGATACCCTTAATGTCAGAAGATGGAGATTAGTCCTCTTTTCAGGGGGGTGTAAGGACTGCTACAG GCTCTGCCCAGGAGTAGCTGAAGGTTCCCCCCCCAAATGGAAGTTGGGGGAGACTAAGGCACAGTAGGATC TGTAGGTGACTGTGGCTTTGGCTAGTGTCTGTTGCCCAAGCCAAGGGGCTCTTGGGGTTGCCTCTACTCTTC CCATTCTTCTTTACCCAGAACTCATTGTGAGCTGGGTAAAAATTGCCCATCTCCTGCTTTTTAAATATTTATTTG AGCAGAGTCTCATGTGTGGCCCAGGCGGGCCTCCACCTCTCTATGTAGCCAAGACTGGCCTTGAACTCCCA ATCTCCTGCCTCCATTGCCACAGTGCTGGTATGACAGGTGTGAGCCCACACCCTGCTTAGAGTAACCTTGCT CTGAGAACCAACATGGCACCCGAGCCTCCAGCCATTCAGGAAACTTCCAGCTGCCTTCATGTAAAACTGCTT TCTCCCCCAACACTGGAAGAGGCCAAGTGTTGGGGGTTCTTCTTGCTTTCCTGAGAGGAAGCCAAGGCATAG AGCAGAAGAGAGGGAGGGACTCTCCCTTCCCAGCTTCCTGCTCATTGTCAGCTTATAGGCAGCCCTTGCAG CTTCTCCCATCTACCCAAAGGGTGAAATAATACCTACCTCACAGGACTGCAGTGAGGCTTGGTGAGATTTTTG TGTTTTTTGTTTTTTTGGCCTGGCTTGGAAAGGCACTGGGAAACAAGGCTAATAACCAGCGAGAATGTTCCAC ATCTATCCTGTCCTCATCTCTGGTTTGCATCCCAATAATATGCATATGCCTCATTCTTCTTCCTTTAGCAACCTT AGGCATCATGACTCAGATGCTTAAAGCATCTTTGTCCCCGGTTCTTTTTTTTTTTTTTTTTTTTTTTTGATGGAG GTACCTGGGACTATGGGAGTACTTTTTTATATTGTTGTTGCCCCAATGCCTGTGATAAATACTAGCGTTTAATG GATAGGGATTAAGAGCACAAATCTCAGTCC TCTTAACAAAGAATGTCTGGCCTAGTGCTAGCGGCATGCCTGTGCAGGCATTACCACGGATTGTGTTAGAAT GTATATTTGCAAAGCCATTTTCTCTAGCCAGACCCTCTGACAGGCAAGTCTTCAAATAGCGATCTCAGGGTTG CTGAGGTTGGTCCCGGTGCCAGTGGGCTACAGCACCTCTCATACGGTTGACTTTGGGGAAACCTGGACCCA TGCAGTTGTGTTGACCTTGATGTCAGTGAGACCAAAGACAAAGCACAAGTACCTTACTCTTGACTTCCAAATA AACTTCTGCCCTTGAGGGCTCAGAAAA Supplement 2

ab initio and Evidence-Based Gene Finding

ab initio and Evidence-Based Gene Finding A basic introduction to annotation Outline What is annotation? ab initio gene finding Genome databases on the web Basics of the UCSC browser Evidence-based gene