Host : Dr. Nobuyuki Nukina Tutor : Dr. Fumitaka Oyama

Similar documents
ab initio and Evidence-Based Gene Finding

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

user s guide Question 1

Genome annotation & EST

Web-based tools for Bioinformatics; A (free) introduction to (freely available) NCBI, MUSC and World-wide.

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

The University of California, Santa Cruz (UCSC) Genome Browser

Annotating Fosmid 14p24 of D. Virilis chromosome 4

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica

Gene Annotation Project. Group 1. Tyler Tiede Yanzhu Ji Jenae Skelton

Annotation Walkthrough Workshop BIO 173/273 Genomics and Bioinformatics Spring 2013 Developed by Justin R. DiAngelo at Hofstra University

BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology

Student Learning Outcomes (SLOS)

HC70AL SUMMER 2014 PROFESSOR BOB GOLDBERG Gene Annotation Worksheet

Aaditya Khatri. Abstract

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G

user s guide Question 3

Identifying Genes and Pseudogenes in a Chimpanzee Sequence Adapted from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. M.

Array-Ready Oligo Set for the Rat Genome Version 3.0

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017

BLASTing through the kingdom of life

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

Multiple choice questions (numbers in brackets indicate the number of correct answers)

Data Retrieval from GenBank

Figure S1 Correlation in size of analogous introns in mouse and teleost Piccolo genes. Mouse intron size was plotted against teleost intron size for t

PrimePCR Assay Validation Report

Chimp Sequence Annotation: Region 2_3

PrimePCR Assay Validation Report

Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain. Elfar Þórarinsson February 2006

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS.

PrimePCR Assay Validation Report

BME 110 Midterm Examination

Guided tour to Ensembl

user s guide Question 3

PrimePCR Assay Validation Report

Chapter 2: Access to Information

Genome Sequence Assembly

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018

Week 1 BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Annotation of a Drosophila Gene

Why learn sequence database searching? Searching Molecular Databases with BLAST

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

Transcription Start Sites Project Report

Chapter 20 Recombinant DNA Technology. Copyright 2009 Pearson Education, Inc.

Draft 3 Annotation of DGA06H06, Contig 1 Jeannette Wong Bio4342W 27 April 2009

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Biotechnology Project Lab

COMPUTER RESOURCES II:

Gene Identification in silico

Annotation of contig27 in the Muller F Element of D. elegans. Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans.

Bioinformatics Course AA 2017/2018 Tutorial 2

PRESENTING SEQUENCES 5 GAATGCGGCTTAGACTGGTACGATGGAAC 3 3 CTTACGCCGAATCTGACCATGCTACCTTG 5

PrimePCR Assay Validation Report

Biotechnology Explorer

PrimePCR Assay Validation Report

GENETICS EXAM 3 FALL a) is a technique that allows you to separate nucleic acids (DNA or RNA) by size.

PrimePCR Assay Validation Report

BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES

Product Applications for the Sequence Analysis Collection

ELE4120 Bioinformatics. Tutorial 5

PrimePCR Assay Validation Report

PrimePCR Assay Validation Report

PrimePCR Assay Validation Report

Bacterial Genome Annotation

Genome Projects. Part III. Assembly and sequencing of human genomes

Protein Bioinformatics Part I: Access to information

PrimePCR Assay Validation Report

Agilent GeneSpring GX 10: Beyond. Pam Tangvoranuntakul Product Manager, GeneSpring October 1, 2008

PrimePCR Assay Validation Report

Chimp Chunk 3-14 Annotation by Matthew Kwong, Ruth Howe, and Hao Yang

SAMPLE LITERATURE Please refer to included weblink for correct version.

PrimePCR Assay Validation Report

Bioinformatics for Proteomics. Ann Loraine

PrimePCR Assay Validation Report

Genomic region (ENCODE) Gene definitions

Chimp BAC analysis: Adapted by Wilson Leung and Sarah C.R. Elgin from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. Michael R.

Chapter 5. Structural Genomics

Microarrays: since we use probes we obviously must know the sequences we are looking at!

Before starting, write your name on the top of each page Make sure you have all pages

PrimePCR Assay Validation Report

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

PrimePCR Assay Validation Report

ELB18S. Entry Level Bioinformatics. Basic Bioinformatics Sessions. Practical 4: Primer Design November (Second 2018 run of this Course)

Lecture 7 Motif Databases and Gene Finding

PrimePCR Assay Validation Report

Browser Exercises - I. Alignments and Comparative genomics

PrimePCR Assay Validation Report

PrimePCR Assay Validation Report

Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010

PrimePCR Assay Validation Report

BIO 202 Midterm Exam Winter 2007

Single-Cell Whole Transcriptome Profiling With the SOLiD. System

What is a Gene? HC70AL Spring An Introduction to Bioinformatics -- Part I. What are the 4 Nucleotides By in DNA?

Assessing De-Novo Transcriptome Assemblies

Fatchiyah

PrimePCR Assay Validation Report

Using the Genome Browser: A Practical Guide. Travis Saari

PrimePCR Assay Validation Report

Transcription:

Method to assign the coding regions of ESTs Céline Becquet Summer Program 2002 Structural Neuropathology Lab Molecular Neuropathology Group RIKEN Brain Science Institute Host : Dr. Nobuyuki Nukina Tutor : Dr. Fumitaka Oyama

Abstract The present study involves an investigation of deregulated genes in Huntington disease. This disease is an autosomal dominant neurodegenerative disorder caused by poly-glutamine expansion in the disease protein, huntingtin. A GeneChip experiment was previously preformed to compare extracted mrnas from cerebrum cells of wild type mice (WT) and of HD mouse models expressing the pathological form of the huntingtin protein. We identified several ESTs that may be involved in the pathogenesis of Huntington disease. To find more information about these ESTs, we developed a bioinformatical method which allows the 5 -ends of the ESTs to be found and the hypothetical Coding Sequences (CDSs) and exons of the mrnas corresponding to the ESTs to be predicted. Using this method, we were able to show homology between the mouse ESTs TC473206 and TC454157 and Human Sodium Channel Beta-2 Subunit Precursor (XM_171492) and Human cdna (AK092285). We built some hypotheses about these two mouse mrnas that are about to be confirmed. We also showed that the mouse protein phosphatase (EST TC478977) is an incomplete mrna sequence. The 9 th exon of its mrna is the 5 -end of the total 9 th exon whose 3 -end is the mouse EST TC492434. Confirmation of our hypotheses are now in process.

Contents Abstract Contents Introduction... 1 Bioinformatical Methodology... 2 1 Blast search using Tigr or NCBI databases... 2 1.1 Tigr Database... 2 1.1.1 Making a search...2 1.1.2 Mouse Gene Index Report...3 1.2 NCBI Database... 3 1.2.1 Non-redundant database...3 1.2.2 EST-Mouse database...4 2 Chromosome location and genomic sequence selection... 4 2.1 Chromosome location... 4 2.2 Genomic sequence selection... 5 2.2.1 Selection process...5 2.2.2 Use of the genomic sequence...5 3 Exons prediction using genomic sequence... 6 3.1 Process of prediction... 6 3.2 Prediction analysis... 6 4 Search in RIKEN 5 -ends sequences database... 7 4.1 Process of the search... 7 4.2 Which sequence for which result... 7 5 Hypothetical mrna, primers... 8 5.1 Mouse mrna... 8 5.2 Hypothetical mrna... 8 6 Confirmation by RT-PCR... 8 Results & Hypotheses... 9 1 ESTs F and C... 9 1.1 Description... 10 1.1.1 EST F: TC454157... 10

1.1.2 EST C: TC473206... 10 1.2 Extension in 5 direction... 10 1.3 Homology... 10 1.3.1 Homology to a mrna similar to the Human Sodium Channel Beta-2 Subunit Precursor (XM_171492)... 10 1.3.2 Homology to Human cdna (AK092285)...11 1.3.3 Views on the two genomes (Figure 2)...11 1.4 5 -ends and Confirmation of the predicted exons and conserved regions... 12 1.5 Hypotheses F and C on the same mrna... 14 1.5.1 Hypothesis FC1: 1 exon of 5 Kbp... 14 1.5.2 Hypothesis FC2: Promoter + 4 or 5 exons: >4 Kbp... 15 1.5.3 Hypothesis FC3 : known 5 -end + 5 or 6 exons : >4,7 Kbp... 16 1.5.4 Hypotheses F4 and C4 on different mrnas: F4=Promoter + 4 exons: >3,5 Kbp; C4=Known 5 -end+5 exons: >3,5 Kbp.... 17 1.6 Primers designed... 18 1.6.1 Hypotheses confirmation... 18 1.6.2 Hypothetical Exons confirmation... 18 2 EST B: TC492434... 19 2.1 Homologous to the human protein phosphatase 1, regulatory subunit 16B (PPP1R16B) (XM_028840)... 19 2.2 Hypothesis B: TC478977+laps+B EST = 6412 bp... 20 2.3 B primer... 21 Conclusion... 22 Acknowledgments Supplements 1- F EST 2- Extended F EST 3- Prediction 1 4- Prediction 2

Introduction Through the Summer Program 2002, the Brain Science Institute of RIKEN provided me the opportunity to work in the lab of Structural Neuropathology directed by Dr. Nukina. In this laboratory, researchers are attempting to find the deregulated genes in Huntington disease. This disease is an autosomal dominant neurodegenerative disorder caused by poly-glutamine expansion in the disease protein, huntingtin. Of course, some of these interesting genes are novel and of unknown function therefore the energy of researchers is mostly involved in finding information about these genes. My supervisor, Dr. Oyama, performed a GeneChip experiment to compare the extracted mrnas from cerebrum cells of wild type mice (WT) and HD mouse models. The HD model mice express the pathological form of the huntingtin protein. On a GeneChip, many of the nucleotide targets are ESTs sequences. An EST is a part of a mrna sequence whose complete sequence, function and constitution are unknown. As a result, these target sequences hybridize with mrna probes for which we have almost no information. The results of the GeneChip analysis demonstrate that some ESTs are strongly down regulated in the HD mice. The down regulation of these genes had been confirmed by northern blotting of the transcriptome of the cerebrum cells of the WT and HD mice. This northern blot allowed the size of the interesting ESTs to be estimated. Those ESTs confirmed to be down regulated by In-situ hybridization in different HD mouse models may be involved in the pathogenesis of the Huntington disease. The first step in the investigation of these ESTs of interest is the amplification of the corresponding mrnas by Reverse Transcriptase-PCR (RT-PCR). To do so, we need the 5 -end sequence of the mrnas. However, this information is unknown for the ESTs. It is in this context that I was asked to develop a bioinformatical method to find the 5 -ends of those ESTs implicated in Huntington disease. This method uses only tools that are freely available on the Internet. It allows the hypothetical Coding Sequences (CDSs) and exons of the mrnas corresponding to these ESTs to be predicted. In the following report, I will explain my methodology. I will then display the results and hypotheses I made about 3 ESTs which are particularly interesting because of their strong and confirmed down regulation in different HD mouse models. 1

Bioinformatical Methodology Using the GeneChip, the mouse mrnas corresponding to ESTs of interest could be identified. It is necessary to find the 5 -end of these mrnas and also to find their constitution in terms of exons and introns. In addition, some idea of the function of these genes would be invaluable. The methodology described below demonstrates the different stages in the search for information on the ESTs. 1 Blast search using Tigr or NCBI databases To begin any work we need the EST sequence. The GeneChip manufacturers provide the TC number associated with the nucleotide target for each plot of the GeneChip. 1.1 Tigr Database To find the information about the TC number, we have to search in The Institute for Genomic Research (Tigr) Database. This database is available in the web site http://www.tigr.org/. 1.1.1 Making a search To make a search with the TC number in the Tigr database we have to go in the page about Gene Indices (http://www.tigr.org/tdb/tgi/). BLAST algorithm On the Gene Indices page, the link BLAST search displays a Query page (http://tigrblast.tigr.org/tgi/). In that Query page, we can choose to work only on the mouse database or with the both human and mouse databases. The BLAST algorithm finds nucleotide sequences from the Tigr Databases that have some similarities with the input sequence. For each similar sequence, a link provides the Mouse Gene Index Report (MGI Report cf. information part 1.1.2 below). This algorithm will also be useful later to confirm the predicted exons (cf. Exons Prediction part 3 below). Search index by identifier Some links allow the selection of the specific genome of an organism. The Mouse link opens the page of Mouse Gene Index (http://www.tigr.org/tdb/tgi/mgi/). The Tigr Mouse Gene Index page provides several tools to search in the Tigr Mouse Databases using the sequence of the ESTs. By selecting the link Search Index by Identifier (TC, ET, EST, GB), the MGI Report page (http://www.tigr.org/tdb/tgi/mgi/searching/reports.html) is displayed, where it is possible to enter the TC number of the GeneChip target and search the EST s information corresponding to this identifier. 2

1.1.2 Mouse Gene Index Report The MGI Report gives the sequence of mrna corresponding to the TC number. The size of the sequence and the predicted Open Reading Frames are provided. There are also alignments of all the ESTs that recognize this mrna sequence. For each of these ESTs, there is a link leading to the sequence and ID numbers of this EST. For some ESTs, the MGI report displays the link Expression summary. This page provides information concerning the expression of the mrna recognized by the EST in different cell types. The MGI report also provides for some TC numbers the Tentative Ortholog Group. This link leads to a page where similar ESTs of different organisms are aligned. Thus, it is possible to find genes homologous to the query mrna in other organisms. For most of the TC numbers, the positions of the EST on the mouse genome is provided in the MGI Report. A link leads to the Genomic Context of the EST s sequence. Here, the alignment of some homologous ESTs or genes in other organisms genomes can be found. 1.2 NCBI Database The web site of the National Center for Biotechnology Information (NCBI http://www.ncbi.nlm.nih.gov /) provides all the tools necessary to make a search on its databases. We can find information about any biological entity in the different available databases. To do the search, we need to select the database (PubMed, Protein, Nucleotide, Structure, Genome ). It is possible to make the search with an ID number, a keyword, or even an author name. The BLAST link (http://www.ncbi.nlm.nih.gov/blast/) displays tools for making an alignment between a protein or a nucleotide sequence and the sequences of the available databases (RefSeq models, GenBank or EMBL ). In the method, we often use the EST-Mouse and the Non-redundant databases. To make a search on these two databases it is enough to select the link Standard nucleotidenucleotide BLAST [blastn] (http://www.ncbi.nlm.nih.gov/blast/blast.cgi) which displays the Query page, where the query sequence can be input. We then select the database we want to work with in the corresponding field of the Query page. 1.2.1 Non-redundant database Making a search in the non-redundant database allows all of the sequences similar to the query sequence to be found. The default options allow 100 sequences to be displayed, irrespective of the score of the alignment between the 2 sequences. It is important to be critical about the results provided at this point. When 2 similar sequences longer than 200 nucleotides have only one hit and the identity is less than 25 nucleotides, the term similar means almost nothing. What we can say is that a small region is similar, so, for example, a profile may be conserved between the both sequences. 3

Moreover, the identifiers change depending of the type of sequences found. Caution is especially necessary with XM sequences (mrnas sequences produced by the NCBI's Genome Annotation Project). These represent the known or potential transcripts of a gene. Such a sequence is not as reliable as a sequence annotated by experiment. To find a human homologous sequence, or to find the mrna sequence corresponding to an EST sequence, we have to study the first and best-scored sequences that the blast proposes. For each of the similar sequences found, a link provides the Information Sheet of the sequence. We can know if this sequence has been predicted or if it is a experimental mrna, if the provided sequence is only the coding regions or the total mrna 1.2.2 EST-Mouse database If the search in the non-redundant database does not provide the total mrna corresponding to the EST sequence, we have to find some mouse ESTs that could extend the sequence we have. We can use the EST-Mouse database to blast the EST sequence directly. But in that case, we usually only find the sequences of smaller ESTs than the sequence we have. This is due to the fact that the sequence provided by the Mouse Gene Index Report is a merge of all the ESTs that recognize the same part of a mrna. Moreover, other ESTs may recognize other parts of the mrna. Therefore, blasting the EST sequence against the EST-Mouse database eventually allows a sequence with a short similar region with our EST sequence to be found and which can extend the sequence in one direction or the other. Generally, we use this database by blasting a part of the genomic sequence near the position of our interesting EST (cf. Genomic sequence selection part 2.2 below). If we find ESTs similar to these genomic regions, we can assume that these regions belong to the mrna for which we search. We can also blast a predicted mrna using this database (cf. Exons prediction part 3 below). If we find some similar EST sequences, one can have confidence in the predicted exons that fit with these sequences. 2 Chromosome location and genomic sequence selection The searches above can generate a large amount of similar sequences. It is interesting to see their positions relative to the original EST sequence along the mouse chromosome. It is also interesting to find the positions of homologous genes or of the conserved regions in another organism. We are mostly interested in the human homologue because the human genome is well annotated. 2.1 Chromosome location To do so, we blast the different sequences with the database of the Human Genome and the Mouse Genome that the NCBI web site provides. 4

If the sequence we blast is smaller than 100 nucleotides, it is better to change the default options (Expect at 10, and filter at none). The BLAST result page displays the alignments and the positions along the contigs in which the similar regions are located. For each alignment, a link leads to a Genome View displaying the hits of the query sequence along the organism genome. By selecting the chromosome of interest on the Genome View, it is possible to see the alignment of the similar regions on this chromosome. On this Chromosome View, positions given are those relative to the total chromosome, which are different to the positions on the contig given by the Blast result page. 2.2 Genomic sequence selection 2.2.1 Selection process On the Chromosome View, two fields allow positions of the view to be changed. We have the choice between a zoom of the hits, or a global view around the hits. To save the selected contig sequence we open the Download/View Sequence/Evidence page. This page reports the positions on the chromosome and on the contig. The links Display and Save to Disk allows the genomic sequence to be saved. The link View Evidence displays all the RefSeq models, GenBank mrnas, annotated known or potential transcripts, and ESTs that align to the area of interest. 1 2.2.2 Use of the genomic sequence Zoomed sequence By blasting a homologous mrna of another organism (e.g. human) against the Mouse Genome, we can find conserved regions on the mouse chromosome. We can then select the zoomed sequences of the similar regions of the mouse chromosome and blast them against the EST- Mouse database (cf. part 1.2 above) or against the Tigr Mouse database (cf. BLAST Search part 1.1.1 above). If we find an EST similar to a genomic region, we can confidently predict that this region is a conserved coding region and belongs to an exon of the mouse mrna we search. Big genomic sequence Some exon prediction algorithms (cf. Exons Prediction part 3 below) can analyze a big genomic sequence around the hit of the original EST s sequence. Moreover, this sequence could contain the totality or some parts of the gene we search. By blasting this mouse gene sequence against the Human Genome, we can find the conserved regions. If the human homologue is known and well annotated and if these regions coincide with some exons of the homologous gene, we can consider these regions as exons or coding regions in the mouse gene. 1 http://www.ncbi.nlm.nih.gov/entrez/genome/evvdoc.html#overlap 5

3 Exons prediction using genomic sequence If we did not find any homologue, or any mouse mrna sequence using the techniques described above, exon prediction may provide information concerning the mrna s constitution. Furthermore, in cases where the sequence of the total CDSs has been found, it is still of interest to define the non-coding regions of the mrna. 3.1 Process of prediction The algorithms we used are Genscan (http://genes.mit.edu/genscan.html) and Grail (http://compbio.ornl.gov/grailexp/). The options allow selecting the kind of organism we work on. Grail predictions may be verified with the nucleotide and EST databases and can sometimes predict the promoter of a predicted gene. The process involves analyzing mouse genomic sequences of different sizes covering the region around the EST s positions using these algorithms. The algorithms often give different results, but some exons are very well predicted (with a good score) by both methods and using different sizes of genomic sequences. 3.2 Prediction analysis To have a clear view of their positions on the mouse chromosome, it is useful to blast these different sequences of predicted mrnas against the Mouse Genome (cf. Chromosome location part 2.1 above). A comparison between the positions of the predicted exons and hits of the homologous human gene (if known) along the mouse chromosome allows confirmation of the prediction. We can also blast these sequences against the Human Genome (cf. Chromosome location part 2.1 above). If the homologue is well annotated, the hits of the predicted mrna that are similar to the exons of the human gene confirm that the regions hit are some exons in the mouse mrna we search. It is also interesting to blast the predicted mrnas against the EST-Mouse database (cf. part 1.2 above) and the Tigr Mouse database (cf. BLAST Search part 1.1.1 above). In this manner, ESTs may be found that could confirm that the predicted exons are part of the mrna for which we search. For each ESTs corresponding to our mrna, confirmation that is derived from the same chromosome as our mrna is necessary. Because of the algorithms default options, some short EST sequences appear in the results even if the scores are bad. If we do not have the information of the location in the MGI Report (cf. MGI Report part 1.1.2 above), we have to blast the EST s sequence against the Mouse Genome. If the EST s sequence is too short, we have to change the default option (cf. Chromosome location part 2.1 above). Then by comparing the Chromosome Views of the ESTs, the predicted mrnas, and the conserved regions between the mouse and human, we can check which exons are confirmed. 6

4 Search in RIKEN 5 -ends sequences database The most interesting information for which we search is the 5 -end of the mrna. If this sequence is known, it is possible to design a primer and amplify the mrna of interest by RT-PCR. 4.1 Process of the search The Gene Science Laboratory of the Genome Exploration Research Group of RIKEN works on the mouse full-length cdna encyclopedia project. This project involves collecting data on most of the mouse full-length cdnas, their primary structures and expression sites. It builds databases of mouse 5 and 3 -ends and of full-length cdnas sequences. These databases are available in the web site http://genome.gsc.riken.go.jp/. To make a search on these databases we select the link Search RIKEN Mouse cdna Encyclopedia on the Home page, or we select the link Our Activities. The Our Activities page displays tools to work on this Encyclopedia. The link Homology search on Our database displays the page of the RIKEN Mouse Encyclopedia Index where the link Homology search leads to a Query page. On this page, we can enter a nucleotide sequence and blast it against the RIKEN databases. A field allows selection of only one or two of the databases we want to work with. The result page gives the ID numbers and links of the cdnas sequences that align with the input sequence. For each cdna s sequence, a link leads to an Information Sheet where we can find the nucleotide sequence and other information about it. 4.2 Which sequence for which result When we have the mrna or predicted mrna corresponding to our EST, we can blast it against the RIKEN 5 -ends sequences database. If we find a 5 -end sequence corresponding to the 5 -end of our mrna, we have enough information to define a primer. If we do not find a corresponding 5 -end, the prediction together with the hits of the human homologue suggest the position of the 5 -end on the mouse genome. A Blast of a part of the genomic sequence around this positions (cf. Genomic sequence selection part 3. above) against the RIKEN 5 -end cdnas sequences database may allow identification of the 5 -end sequence. Sometimes, the Information sheet displays a 3 -end sequence associated with the 5 -end sequence we found. In that case, we have to blast these two sequences against the Mouse Genome. With this information, we can check if our EST corresponds to this 3 -end sequence. Hence, we know if we have predicted the correct gene and not the following gene on the mouse chromosome. 7

5 Hypothetical mrna, primers Regarding the data we managed to collect for our mouse EST, we can design hypothetical mrnas for this EST. It is then possible to select specific sequences that could be the primers to confirm the hypotheses. 5.1 Mouse mrna Sometimes it may be possible to find the mouse mrna sequence that corresponds to our EST. In that case, the primer can be the 5 -end of this mrna sequence. If the 5 -end has been confirmed by a blast against the RIKEN 5 -end cdnas sequences database, we can use the RIKEN sequence to design the primer. 5.2 Hypothetical mrna If we do not have a mouse mrna sequence that has been experimentally found, but only the predicted mrna, several different predictions for the mrna constitution may be equally compatible. These hypotheses are built by regarding which exon corresponds to a hit of the human homologue on the mouse chromosome, which exon corresponds to a mouse EST and if the RIKEN 5 -end has been found. If we have confirmed the 5 -end of the predicted mrna, we can use the RIKEN 5 -end sequence to design a primer. If we do not have confirmation of the 5 -end of the predicted mrna, we can use the first predicted exon to design a primer. If the subsequent amplification does not work, the 5 -end of human homologous mrna (if known) may also be tried. If we want to confirm our hypotheses concerning the mrna s constitution, we can use each of the predicted exons sequences or the ESTs sequences that confirm the predicted exons to design the primers. 6 Confirmation by RT-PCR The next step consists in confirming the hypothesis. To do so, we perform RT- PCR using the different primers designed above. The northern blot of the RT-PCR product shows if the primer belongs to our mrna or not. It also gives the size of the RT-PCR product. If the size of the RT-PCR product is similar to the estimated size (cf. Introduction above) there is a big probability that the primer we used is the 5 -end of the mrna we want to study. The RT-PCR products are all sequenced and an analysis of the sequences (Blast against the Mouse Genome cf. Chromosome location part 2.1 above) will display the real exons. By comparing the Chromosome Views, we will confirm or not the hypothetical exons that we have defined in the method. 8

Results & Hypotheses We now demonstrate the information and data we found using the aforementioned method about 3 ESTs particularly down regulated in HD mice. 1 ESTs F and C These 2 ESTs seems to have a strong impact on the pathogenesis of the Huntington disease. They are both strongly down regulated in the HD 150-1 8weeks mouse model. This down regulation was confirmed in the C01 16 weeks mouse model as we can clearly see on the northern blots (cf. Figure1 below). ESTs F and C will be studied together because, as we will see below, they are linked one with the other. Northern blots of ESTs C and F between wild type and HD mice 28S 4,6 kb 5 kb 28S Figure 1: Northern blots of ESTs C and F 9

1.1 Description 1.1.1 EST F: TC454157 The size of the F mrna has been estimated by northern blot at 5kb (cf. Figure1 above). The sequence of the F EST we have is 2729 bp (cf. Sequence in Supplement 1: F EST). A sequence of 2413 bp sequenced in the lab (and not submitted to GenBank yet) and the 3 -end of the sequence we found in the MGI report TC454157 constitute this sequence. Its position on mouse chromosome n 9 is 45406900-45409650. 1.1.2 EST C: TC473206 The sequence C we have is 568 bp (this sequence is available in the Tigr Database cf. Methodology, part 1.1.2). The estimated size of the C mrna by northern blot is about 4,6 Kbp (cf. Figure1 above). The position of this sequence on mouse chromosome n 9 is 45409890 45410540. 1.2 Extension in 5 direction We found the 5 sequence of mouse cdna BI101131 (918 bp) that extends sequence F (identity between positions 367-687 of sequence BI101131 and the 5 - end (positions 1 313) of sequence F). We merged the sequence F EST of 2729 bp and the first 366 nucleotides of BI101131. Thus, the 3 -end of this EST had been deleted. We obtained an extended F EST of 3095 bp (cf. Sequence in Supplement 2: Extended F EST). Its positions on the mouse chromosome n 9 are 45405500-45405700 and 45406650-45409650. This extension will have to be confirmed. 1.3 Homology We took the mouse genomic sequence Mouse Contig 1 from positions 45382000 to 45411000 on the mouse chromosome n 9 (= positions 4588388-4617388 on the mouse contig NW_000352 cf. Methodology, Genomic sequence selection part 2.2) to check for conserved parts of mouse chromosome n 9 in the human genome (cf. Methodology, Chromosome location part 2.1). We found that mouse chromosome n 9 is similar to some part of human chromosome n 11. 1.3.1 Homology to a mrna similar to the Human Sodium Channel Beta-2 Subunit Precursor (XM_171492) We can see below in the view of human chromosome n 11 that most of the hits of the mouse genomic sequence Mouse Contig 1 are in the gene of a human mrna similar to the Human Sodium Channel Beta-2 Subunit Precursor (XM_171492). But most of these hits do not fit with an exon of this mrna. This could 10

be due to the fact that this mrna has been predicted (the predicted exons could be too short and the similar regions could be extensions of these predicted exons). Alternatively, our observation could be due to the fact that a gene was interpenetrated in the Human Precursor gene. The predicted Human Precursor mrna (XM_171492) is only 1182 bp. Its position on human chromosome n 11 is 120364500-120392000 (cf. view part 1.3.3 below). 1.3.2 Homology to Human cdna (AK092285) We found that the EST C is similar to the 3 -end of Human cdna (AK092285) whose function is unknown. Its size is 2766 bp. Its location on human chromosome n 11 is following the 3 -end of the Human mrna (XM_171492) defined above. To see the positions of the conserved regions on the mouse chromosome n 9 we also took a human genomic sequence Human Contig 1 of the similar region on human chromosome n 11. We used the positions 120357929 120394073 of human chromosome n 11 (= positions 7543414-7579558 on the human contig NT_035088 cf. Methodology, Genomic sequence selection part 2.2). 1.3.3 Views on the two genomes (Figure 2) We can see in Figure 2 the conserved parts of the mouse genomic sequence and the hits of the Human cdna (AK092285) along human chromosome n 11. We can also see the conserved parts of human chromosome n 11 on mouse chromosome n 9. The annotated gene LOC 254798 is corresponding to the mrna similar to the Human Sodium Channel Beta-2 Subunit Precursor (XM_171492). The labels in green show the similar regions in both the genomes. As the two homologous regions are not oriented in the same direction in the two genomes, the numbers of the hits are reversed. But the hit 1 of the Mouse Contig 1 on the human chromosome is exactly the sequence where the hit 1 of the Human Contig 1 blasts on mouse chromosome n 9. The green labels are used to map similar parts between mouse chromosome n 9 and human chromosome n 11. We report also the positions of similar regions between the exons of the mrna of the Human Sodium Channel Beta-2 Subunit Precursor (XM_171492) and mouse chromosome n 9. The labels in black show the positions of exons constituting this Human mrna (XM_171492) and its hits in the mouse genome. 11

Views on human chromosome n 11 and on mouse chromosome n 9 Human cdna (AK092285) Mouse Contig 1 Human Contig 1 4 Hit 6= Hit of EST C Hit 1 =678 bp 3 2 Hit 5 Hit 2 =650 bp 1 Hit 4 Hit 3bis Hit 3 Hit 4 Hit 3= hit of 5 -end BB84945 Hit 2 Hit 1= 5 end of mouse contig1 1 2 3 4 Hit 5 = Exon of Human Precursor Hit 6= 5 -end of Human contig1 Figure 2: View on human chromosome n 11 and mouse chromosome n 9 of the Human cdna AK092285, of the Mouse Contig 1 and of the Human Contig 1. 1.4 5 -ends and Confirmation of the predicted exons and conserved regions We made 2 predictions using different sizes of mouse contig (cf. Methodology, Genome sequence selection part 2.1 and Exons prediction part 3.1, and Sequence in Supplement 3: Prediction 1 and Supplement 4: Prediction 2). We then attempted to confirm the predicted exons (cf. Methodology, Prediction s analysis part 3.2). We found some ESTs that correspond to the hits Prediction 2 and of Human Contig 1 along mouse chromosome n 9. The hit 1 of the Human Contig 1 is confirmed by the mouse ESTs TC513497 (678bp) and TC508375 (623 bp). The hit 1 of the Prediction 2 and the hit 2 of the sequence Human Contig 1 coincide with a mouse 5 -end EST BB643846 (650 bp). But the associated 3 -end (BB305394, 603 12

bp) in the RIKEN database is oriented in the bad direction. The hit 2 of the Prediction 2 and the hit 3 of the alignment of the Human Contig 1 along mouse chromosome n 9 correspond to the 5 -end EST BB592627 (228 bp). The hit 3 of the alignment of the Human Contig 1 along mouse chromosome n 9 is also confirmed by the 5 -end EST TC504903 (520 bp). All these ESTs sequences can be found by the method explained in part 1.1 of the Methodology about the search by identifier. View on mouse chromosome n 9 We can see in Figure 3 the positions of the different ESTs and their sizes. We can see also the sizes of the hits of the Prediction 2 and of the Human Contig 1 on mouse chromosome n 9. The sizes we will use for the different hypothetical exons and conserved regions in the hypothetical mrnas below are shown in bold. The labels in gray will be used in the following figures to provide information about the hits not considered to be hypothetical exons. The labels in green show the similar regions in both the mouse and human genomes. The hits of Human Precursor mrna (XM_171492) along the mouse chromosome n 9 are shown in black. Prediction 2 Human Contig 1 All Confirmations Hit 1 =629 bp TC513497 (678bp) TC508375 (623 bp). 1301 bp Hit 1= 95bp Hit 2= 284bp Hit 2 =296 bp Hit 3 =430 bp Hit 4 217 bp BB643846 650 bp TC504903 (520 bp). BB592627 (228 bp) 748 bp 1 2 3 Hit3 173 bp Hit4 228 bp 1 2 3 Hit 5 = 183 bp 4 4 Hit 6= 5 end of mouse contig1 Figure 3: View on mouse chromosome n 9 of the Prediction 2, of the Human Contig 1 and of the ESTs of predicted exons and hits confirmation. 13

1.5 Hypotheses F and C on the same mrna We can consider the F and C ESTs as belonging to the same mrna. Because the estimated sizes for these two ESTs are similar, and these 2 sequences are located close one to another on mouse chromosome n 9. 1.5.1 Hypothesis FC1: 1 exon of 5 Kbp The genomic sequence between the 5 -end position and the 3 -end of the C EST is 5kb. F and C are proposed to recognize the same mrna and it is proposed that this mrna constitutes only 1 exon of 5 Kbp. This could be the homologous gene of the Human cdna (AK092285). View on mouse chromosome n 9 In Figure 4, the hypothetical FC1 mrna of 5 Kbp is shown in red. The hits of the Human Precursor mrna (XM_171492) with the mouse chromosome n 9 are shown in black. The regions of similarity between C EST and the Human cdna (AK092285) are shown in purple. Extended F and C ESTs Hit 5 -end of Extended F 139 bp 3 4 Hit F EST 2959 bp MRNA FC 1 5 kb Hits EST C 650 bp Figure 4: View on mouse chromosome n 9 of the hypothetical mrna FC1 14

1.5.2 Hypothesis FC2: Promoter + 4 or 5 exons: >4 Kbp We also consider the possibility that there are some exons in the FC mrna. A predicted promoter in the 5 direction of F and C ESTs was found with the Prediction 1 (cf. sequence Supplement 3: Prediction 1). This could be the promoter of the FC2 mrna.. This FC2 mrna seems to be homologous to the Human Sodium Channel Beta-2 Subunit Precursor (XM_171492) because the predicted exons all coincide with regions of similarity between this Human Precursor mrna and mouse chromosome n 9 (cf. exons positions part 1.4 above). If we consider the 4 exons of the hypothetical mrna in Figure 5, we obtain 4159 bp. But it is noteworthy that predicted exons are always smaller than real exons. For this reason, we estimate that the predicted exons are longer in reality. We can consider F and C as belonging to the same exon of 3890 bp. In that case we have a mrna of 4 exons of about 4440 bp. This size is quite similar to the estimated size (cf. Introduction). View on mouse chromosome n 9 The hits of the Human Precursor mrna (XM_171492) on mouse chromosome n 9 are shown in black. The hypothetical FC2 mrna is shown in red. Extended F and C ESTs Prediction 1 Promotor 1 2 Hit1 5 end of FC2 mrna Exon 1 183 bp Hit2 Exon 2 FC2 228 bp Hit 5 -end of Extended F 139 bp Exon 3 FC2 3 4 Hit F EST 2959 bp Exon4 FC2 Hits EST C 650 bp Exon 4 bis or Exon5 3 -end mrna FC2 Figure 5: View on mouse chromosome n 9 of the hypothetical mrna FC2 15

1.5.3 Hypothesis FC3 : known 5 -end + 5 or 6 exons : >4,7 Kbp We can consider that the 5 -end of TC504903 (520 bp) is the 5 -end of the mrna FC3 and that this hypothetical mrna is homologous to the Human Sodium Channel Beta-2 Subunit Precursor (XM_171492). We define a mrna FC3 constituted of 6 exons. The size of the hypothetical mrna is 4907 bp (cf. Figure 6 below). But we can also consider C and F as occurring on the same exon of 3980bp. So we have a hypothetical mrna of 6 exons and the new size is 5278 bp. View on mouse chromosome n 9 The hits of the Human Precursor mrna on mouse chromosome n 9 are shown in black. The size of the hit 3 of the prediction 2 is the size of the hit 5 of the Human Contig 1 on this region is labeled in green (cf. Sizes of similar regions part 1.1.1). The hypothetical FC3 mrna is shown in red. Extended F and TC504903 (520 bp) Prediction 2 C ESTs BB592627 (228 bp) TC504903 BB592627 748 bp Exon 1 5 -end of FC3 mrna Hit 2= 284bp Hit 5 -end of ExtendedF 139 bp Exon 4 FC3 Hit F EST 2959 bp Exon5 FC3 1 2 3 4 Hit3 Exon 2 FC3 183 bp Hit4 Exon 3 FC3 228 bp Figure 6: Hits of EST C 650 bp Exon 5bis or Exon6 3 -end of FC3 mrna View on mouse chromosome n 9 of the hypothetical mrna FC3 16

1.5.4 Hypotheses F4 and C4 on different mrnas: F4=Promoter + 4 exons: >3,5 Kbp; C4=Known 5 -end+5 exons: >3,5 Kbp. We can consider the possibility that gene F is similar to the Human Sodium Channel Beta-2 Subunit Precursor (XM_171492), while C is homologous to Human cdna (AK092285). In that case, the Human cdna (AK092285) sequence we have is not the total mrna of that human gene. It seems that the C gene and its human homologue (AK092285) constitute exons which interpenetrate in Sodium Channel Beta-2 Subunit Precursor gene of the two organisms. View on mouse chromosome n 9 We consider that all of the regions of similarity between human chromosome n 11 and mouse chromosome n 9 that does not belong to a Human precursor exon constitute some part of the C4 mrna. In Figure 7, the hypothetical mrna C4 of 3566 bp consisting of 5 exons is shown in red. The 4 hypothetical exons of the mrna F4 of 3509 bp are shown in pink. We know the promoter of the gene F4. Since we estimated the size of the hypothetical exons only by taking the size of the ESTs or hits of the regions of similarity between the human and the mouse, we consider that the real exons of the C4 and F4 mrnas are longer. Extended F and C Prediction 2 All Confirmations Human Contig 1 TC513497 TC508375 Exon1 C4 5 end of C mrna 1301 bp Hit 1 =629 bp Hit 1= 95bp Hit 2= 284bp Promotor of F gene BB643846 Exon2 C4 650 bp TC504903 BB592627 Exon3 C4 748 bp Hit 2 =296 bp Hit 3 =430 bp Hit 4 Exon4 C4 217 bp Figure 7: 139 bp Exon3 F4 2959 bp Exon4 F4 3 -end of mrna F4 650 bp Exon5 C4 3 -end of mrnac4 1 2 3 4 Hit3 173 bp Hit4 Exon2 F4 228 bp Hit 5 = Exon1 F4 183 bp 5 -end of FmRNA Hit 6= 3 end of Human Contig1 View on mouse chromosome n 9 of the hypothetical mrnas F4 and C4 17

1.6 Primers designed 1.6.1 Hypotheses confirmation The first thing is to confirm whether or not F and C are part of the same mrna. To test this need to be performed a RT-PCR experiment with the 3 -end of F as the 5 primer, and the 5 -end of C as the 3 -end primer (cf. Sequence F in Supplement 1 EST F, to find Sequence C cf. Methodology, part 1.1). If there is amplification, hypothesis F4/C4 is false. If not, hypotheses FC1, 2 and 3 are false. Hypothesis FC1: 1 exon of 5 Kbp Here, it is considered that the extended F EST and C EST belong to the same mrna and constitute 1 exon. So as primers, we should use the 5 -end of the EST BI101131 (cf. Sequence in Supplement 2 Extended F EST) and the 3 -end of the C EST (Sequence C cf. Methodology, part 1.1). If the size of the RT-PCR product is more than 4,5 Kbp, it confirms the hypothesis FC1. Hypothesis FC2: Promoter + 4 or 5 exons: >4 Kbp The 5 -end is the hit 1 of the prediction 1, so the 5 primer could be designed from a part of this predicted exon (cf. sequence position 1 to 156 in Supplement 3: Prediction 1). The 3 -end primer will be the 3 -end of the C EST (Sequence C cf. Methodology, part 1.1). If the size of the RT-PCR product is around 4,5 Kbp, we confirm the hypothesis FC2. Hypothesis FC3 : known 5 -end + 5 or 6 exons : >4,7 Kbp The 5 -end is the EST TC504903, so we can take its sequence to design the 5 primer. The 3 -end primer is still the 3 -end of the C EST (Sequences cf. Methodology, part 1.1). If the size of the RT-PCR product is around 4,5 Kbp, we confirm the hypothesis FC3. Hypotheses F4 and C4 on different mrnas: F4=Promoter + 4 exons: >3,5 Kbp; C4=Known 5 -end+5 exons: > 3,5 Kbp. The EST TC513497 is considered to be the 5 -end of the C mrna. Its sequence can be used to design the 5 primer and with C as the 3 primer of RT- PCR. If the size of the RT-PCR product is around 4,5 Kbp, we confirm the hypothesis C4. The 5 -end of F4 mrna is the first hit of the Prediction 1. So this exon can be used as the 5 primer and the 3 primer should be the 5 -end of the EST F (5 primer cf. sequence positions 1 to 156 in Supplement 3: Prediction 1, EST F Sequence in Supplement 1: EST F). If the RT-PCR product is about 2 Kbp, it confirms the hypothesis F4. 1.6.2 Hypothetical Exons confirmation If with all the previous RT-PCR, we did not find the 5 -end of the mrna(s), we can try the different ESTs TC508375 and BB643846 as the 5 primers. We should then have an idea about the constitution of exons within the mrna(s), and so be able to confirm the existence (or not) of most of the hypothetical exons. But it is possible that we will still not have found the 5 -end. In that case the 5 -end has not been predicted nor sequenced yet, or may be further in the 5 direction along mouse chromosome n 9. More predictions on a bigger genomic sequence or a walk along mouse chromosome n 9 will then be required to find the 5 -end. 18

2 EST B: TC492434 The EST B is 1489 bp length. The estimated size of the B mrna by Northern Blot is about 6 Kbp. This sequence is located on mouse chromosome n 2 at position 159645900-159657380. This EST has been shown to be down regulated in some HD mouse models. 2.1 Homologous to the human protein phosphatase 1, regulatory subunit 16B (PPP1R16B) (XM_028840) The predicted mouse mrna XM_149233 (949 bp) is similar to the B EST. We can see on the Figure 8 that it blasts with the 3 -end of the human protein phosphatase 1, regulatory subunit 16B (PPP1R16B) (XM_028840) (6162 bp). We can note that the size of this human mrna is similar to the expected size of the B mrna. So the B gene we search seems to be the homologue of this human gene. View on human chromosome n 20 Mouse mrna XM_149233 Human mrna XM_028840 Exons 1 to 8 : nucleotide from 1 to 1246 =1246 bp Figure 8: View on human gene XM_028840 of the mouse mrna XM_149233 Exon9 Nucleotide 1247 to 6113 = 4866 bp 19

2.2 Hypothesis B: TC478977+laps+B EST = 6412 bp The sequence TC478977 (2279 bp) is the mrna of the mouse protein phosphatase 1 regulatory subunit 16B. This mrna is incomplete because the part recognized by the EST B and by the sequence TC478977 is not present in this sequence. However the 5 -end of the mrna sequence TC478977 is the real 5 -end of the mouse phosphatase 1 regulatory subunit 16B because we found the similar 5 - end EST sequence BB642406 (657bp). But, we note that the total mrna TC478977 (2279 bp)+b EST (1489 bp)= 3768 bp. Therefore, the size is shorter than the estimated size. On the views of human chromosome n 20 (cf. Figure 8 right above), the complete Human XM_028840 gene can be observed. We reported the positions of the 9 th exon along the mrna sequence, and its size. We did the same below (cf. Figure 9 below) for the 9 th exon of the mouse phosphatase mrna. We note that the 8 first exons contain almost the same numbers of nucleotides for the two homologues. We also note that the mouse genomic sequence between the 5 -end of the 9 th mouse exon and the 3 -end of the B EST is of a similar size to the human mrna s 9 th exon (>4,8 Kbp, cf. position in Figure 9 below). Thus the hypothetical mrna of 6412 bp was defined to consist of the first 8 exons of the mouse mrna XM_028840 together with a 9 th exon of 4985 bp (in red in the Figure 9 below). 20

View on mouse chromosome n 2 EST B TC478977 Exons 1 to 8 : of B gene nucleotide from 1 to 1427 =1427 bp EST B Hypothetical Exon9 B gene = 4985 bp Exon 9 Nucleotide 1424 to 2255 = 833 bp Figure 9: View on mouse chromosome n 2 of the hypothetical B mrna 2.3 B primer To confirm this hypothesis it is enough to use the 3 -end of the mouse mrna TC478977 (to obtain the sequence cf. Methodology, part 1.1.2). The 5 primer should be the 5 part of the sequence between the nucleotides 1424 and 2255. The 3 primer should be the 5 -end of the B EST. If the product of the PCR is between 2.5 and 3 kb, our hypothesis is confirmed and we will have found the total B mrna for which we search. 21

Acknowledgments First, I would like to thank Dr. Nobuyuki Nukina for his invitation to work in his laboratory of Structural Neuropathology. He gave me the wonderful opportunity to come to work in Japan and especially in the prestigious Brain Science Institute of RIKEN. Throughout the internship, he was always available to discuss my results and hypotheses. I thank also Dr Fumitaka Oyama for all the explanations he provided about my data. Each time I had a problem in my results, he was available to help me solve it. He also provided me the guidance I needed to organize my work during the 2 months of my internship. Thanks also to the secretary of the Structural Neuropathology group, Miss Harumi Taniguchi, who was always ready to provide immediate help in finding solutions to solve the multiple technical problems my colleague and I had during the training period. I thank in particular this colleague, Katrin Lindenberg, for our discussions about our results and for our multiple expeditions of discovery and shopping in Tokyo. I also thank all the team of the Structural Neuropathology Lab for its kindness. I particularly thank David Chapmon for his help in finding medical care, for his translation during the consultation and for correcting my English pronunciation. I thank all the summer students and the many foreign researchers at RIKEN for having helped me spend a nice time in Japan by showing me the entertaining parts of Tokyo and by advising me about the Japanese way of life. I also thank Jean-Michel Fayard, Guillaume Beslon, Hedi Soula and all my teachers for the help and advice they provided me before my departure and during the internship.

F EST >EST F: 2413 bp (sequenced in the lab, not submitted to GenBank yet)+ 316 bp of TC454157= 2729 bp TCTCTCCCCAGCCAGGGCTTCCTAGGGACAAGGGTTGGTTGACTGGGGGAGGAAGCCTACAGG AGATTGAAGACAGGGAAGGGAGGGGCTGGAGTGGTGTGGAAGGTTGGTTCCCGGATCCTGGGC ACGTGGGGTCTCCTTTAGATTTTCCCCTCTGTGAAGCCTTGTTTTCTCCTCAGTTTTCCTTCTGAT CTTTCACCAGGAAATCGGGGTGACCAGTGAGGGCTGCTTCCAAAGCTGGGGTTTGGAGATGGGT AGAGGGTGACCGCTTCAGAAGCTGGGAATGCACAAGAAGTCTAGAATGGTGTCTTCTGGGGGGG GGGGCAGTTGTGAGAGGCAAGCTGGGCTCTGAAGAATATCAGGCTTCTGGAAGTTCCTTTAGAG AGGACTTCTCTTTCCCTTACCCTAGAACACCTGCCCACACTGTCCTGGCTCCCCGACCAGCCTCC TCCTGCTGCCTGCCTAGTCTGTCTTTGCTCTCTGGGCTGCAGCTGCTGAGGAGGCTTGTGGGGA GGGGGCAGCCTCCACTCTCCTGGAGCACTGGGGTGCTATTTGCAGCTATACTGGCTTTGCTCTTT GGGTTTCAGAGGCAGGAGAACAGTGCCCCTGGTCTCCTAGCCTTTGGAATGTCTACCCCAGCCC TACAAGACTGACAGCCCTTGTCCTTGGCATGGCAGGACCATGCCACCCTGGCACTTCCGGAGCT CAGTTTTTCACTCTTCTTCCCTTCCCTTGAAACAGCTGGCATTGCCACCTTCCCTGAGGGATGCTT TCCTAGGACTTGTCATCTCATACCTTTGCTCCTTCTGTGTCCATCCAGCATGCCTGGCCTTCCCCT GCTCCTGGCCCCCCAGCTCTGGGTCTGCCTTTGCCTCAGGGACCCTTGTTTCCAGATGAGAAGG CCCTTGGCTTTTCCAGCTTCTTTTTTGCCCAGCTGGGCTGACTCCTCGCCTAGCCTGAGGCTGAG GAGGAGCTGGGAGAAGGTACTCACACCTTCTCTTGACTTCTGGCAGAGCCGGCTTGCACACCCC CTGAGTGTGGGGCTAGATTGTGCCTTAGTTCCTCGAGTCCTGGTTCTGAGCCCCTTTTCTTTCGG CTCACACTCCCTGAATTAATTGCACAGCTTGGTGTGACTTTGGCGGGGCTCCCCAGCTCCTTACC CCAAAGCCATGGAAGAGACCATGAAGCCGGGGTTGGTGGCAACCTTGATGACACCTGAGGGCA CCCTTTCTTGTCCCTGACATGGAGATAGGATGGCATTTGATGTGGGACCTTCAGATGGGTTTGAC CGTGTACAAACCGTAGTGCTAGCTAGGGTTTCTGTGATGTATGAAATGGGATACCCAAAGTCCCT CTTCCTCATCAGATTTCTGATACCCTTAATGTCAGAAGATGGAGATTAGTCCTCTTTTCAGGGGGG TGTAAGGACTGCTACAGGCTCTGCCCAGGAGTAGCTGAAGGTTCCCCCCCCAAATGGAAGTTGG GGGAGACTAAGGCACAGTAGGATCTGTAGGTGACTGTGGCTTTGGCTAGTGTCTGTTGCCCAAG CCAAGGGGCTCTTGGGGTTGCCTCTACTCTTCCCATTCTTCTTTACCCAGAACTCATTGTGAGCT GGGTAAAAATTGCCCATCTCCTGCTTTTTAAATATTTATTTGAGCAGAGTCTCATGTGTGGCCCAG GCGGGCCTCCACCTCTCTATGTAGCCAAGACTGGCCTTGAACTCCCAATCTCCTGCCTCCATTGC CACAGTGCTGGTATGACAGGTGTGAGCCCACACCCTGCTTAGAGTAACCTTGCTCTGAGAACCAA CATGGCACCCGAGCCTCCAGCCATTCAGGAAACTTCCAGCTGCCTTCATGTAAAACTGCTTTCTC CCCCAACACTGGAAGAGGCCAAGTGTTGGGGGTTCTTCTTGCTTTCCTGAGAGGAAGCCAAGGC ATAGAGCAGAAGAGAGGGAGGGACTCTCCCTTCCCAGCTTCCTGCTCATTGTCAGCTTATAGGCA GCCCTTGCAGCTTCTCCCATCTACCCAAAGGGTGAAATAATACCTACCTCACAGGACTGCAGTGA GGCTTGGTGAGATTTTTGTGTTTTTTGTTTTTTTGGCCTGGCTTGGAAAGGCACTGGGAAACAAG GCTAATAACCAGCGAGAATGTTCCACATCTATCCTGTCCTCATCTCTGGTTTGCATCCCAATAATA TGCATATGCCTCATTCTTCTTCCTTTAGCAACCTTAGGCATCATGACTCAGATGCTTAAAGCATCTT TGTCCCCGGTTCTTTTTTTTTTTTTTTTTTTTTTTTGATGGAGGTACCTGGGACTATGGGAGTACTT TTTTATATTGTTGTTGCCCCAATGCCTGTGATAAATACTAGCGTTTAATGGATAGGGATTAAGAGC ACAAATCTCAGTCC TCTTAACAAAGAATGTCTGGCCTAGTGCTAGCGGCATGCCTGTGCAGGCATTACCACGGATTGTG TTAGAATGTATATTTGCAAAGCCATTTTCTCTAGCCAGACCCTCTGACAGGCAAGTCTTCAAATAG CGATCTCAGGGTTGCTGAGGTTGGTCCCGGTGCCAGTGGGCTACAGCACCTCTCATACGGTTGA CTTTGGGGAAACCTGGACCCATGCAGTTGTGTTGACCTTGATGTCAGTGAGACCAAAGACAAAGC ACAAGTACCTTACTCTTGACTTCCAAATAAACTTCTGCCCTTGAGGGCTCAGAAAA Supplement 1

Extended F EST >Extended F EST : 366 bp of 5 -end of BI101131 +2413 bp (sequenced in the lab, not submitted to GenBank yet)+ 316 bp of TC454157= 3095 bp ATTGGAAAAAGTGGACAACACGGTGACTCTCATCATCCTGGCTGTGGTGGGCGGGGTCAT TGGACTTCTTGTGTGCATCCTTCTGCTGAAGAAGCTCATCACCTTCATCCTGAAGAAGAC CCGAGAGAAGAAGAAGGAGTGTCTCGATGAGTTCCTCTGGGAATGACAACACAGAGAACG GGTTGCCTGGCTCCAAGGCAGAAGAGAAGCCACCCACAAAAGTGTGAGGCCCTGCTCGGGCCAAGCAGGG CAGGGAGCCTCGCTTTCTGATGGTGATCCTGATGCCAAGTCCTATCTGAG ATGTGTGCTGCTTGGCCCAAACTGTTCTTTCTGAGCAGGAAGGACCTGGCCCTGCCCAGC TGCCGT TCTCTCCCCAGCCAGGGCTTCCTAGGGACAAGGGTTGGTTGACTGGGGGAGGAAGCCTACAGGAGATTGAA GACAGGGAAGGGAGGGGCTGGAGTGGTGTGGAAGGTTGGTTCCCGGATCCTGGGCACGTGGGGTCTCCTT TAGATTTTCCCCTCTGTGAAGCCTTGTTTTCTCCTCAGTTTTCCTTCTGATCTTTCACCAGGAAATCGGGGTGA CCAGTGAGGGCTGCTTCCAAAGCTGGGGTTTGGAGATGGGTAGAGGGTGACCGCTTCAGAAGCTGGGAATG CACAAGAAGTCTAGAATGGTGTCTTCTGGGGGGGGGGGCAGTTGTGAGAGGCAAGCTGGGCTCTGAAGAAT ATCAGGCTTCTGGAAGTTCCTTTAGAGAGGACTTCTCTTTCCCTTACCCTAGAACACCTGCCCACACTGTCCT GGCTCCCCGACCAGCCTCCTCCTGCTGCCTGCCTAGTCTGTCTTTGCTCTCTGGGCTGCAGCTGCTGAGGA GGCTTGTGGGGAGGGGGCAGCCTCCACTCTCCTGGAGCACTGGGGTGCTATTTGCAGCTATACTGGCTTTG CTCTTTGGGTTTCAGAGGCAGGAGAACAGTGCCCCTGGTCTCCTAGCCTTTGGAATGTCTACCCCAGCCCTA CAAGACTGACAGCCCTTGTCCTTGGCATGGCAGGACCATGCCACCCTGGCACTTCCGGAGCTCAGTTTTTCA CTCTTCTTCCCTTCCCTTGAAACAGCTGGCATTGCCACCTTCCCTGAGGGATGCTTTCCTAGGACTTGTCATC TCATACCTTTGCTCCTTCTGTGTCCATCCAGCATGCCTGGCCTTCCCCTGCTCCTGGCCCCCCAGCTCTGGG TCTGCCTTTGCCTCAGGGACCCTTGTTTCCAGATGAGAAGGCCCTTGGCTTTTCCAGCTTCTTTTTTGCCCAG CTGGGCTGACTCCTCGCCTAGCCTGAGGCTGAGGAGGAGCTGGGAGAAGGTACTCACACCTTCTCTTGACT TCTGGCAGAGCCGGCTTGCACACCCCCTGAGTGTGGGGCTAGATTGTGCCTTAGTTCCTCGAGTCCTGGTT CTGAGCCCCTTTTCTTTCGGCTCACACTCCCTGAATTAATTGCACAGCTTGGTGTGACTTTGGCGGGGCTCC CCAGCTCCTTACCCCAAAGCCATGGAAGAGACCATGAAGCCGGGGTTGGTGGCAACCTTGATGACACCTGA GGGCACCCTTTCTTGTCCCTGACATGGAGATAGGATGGCATTTGATGTGGGACCTTCAGATGGGTTTGACCG TGTACAAACCGTAGTGCTAGCTAGGGTTTCTGTGATGTATGAAATGGGATACCCAAAGTCCCTCTTCCTCATC AGATTTCTGATACCCTTAATGTCAGAAGATGGAGATTAGTCCTCTTTTCAGGGGGGTGTAAGGACTGCTACAG GCTCTGCCCAGGAGTAGCTGAAGGTTCCCCCCCCAAATGGAAGTTGGGGGAGACTAAGGCACAGTAGGATC TGTAGGTGACTGTGGCTTTGGCTAGTGTCTGTTGCCCAAGCCAAGGGGCTCTTGGGGTTGCCTCTACTCTTC CCATTCTTCTTTACCCAGAACTCATTGTGAGCTGGGTAAAAATTGCCCATCTCCTGCTTTTTAAATATTTATTTG AGCAGAGTCTCATGTGTGGCCCAGGCGGGCCTCCACCTCTCTATGTAGCCAAGACTGGCCTTGAACTCCCA ATCTCCTGCCTCCATTGCCACAGTGCTGGTATGACAGGTGTGAGCCCACACCCTGCTTAGAGTAACCTTGCT CTGAGAACCAACATGGCACCCGAGCCTCCAGCCATTCAGGAAACTTCCAGCTGCCTTCATGTAAAACTGCTT TCTCCCCCAACACTGGAAGAGGCCAAGTGTTGGGGGTTCTTCTTGCTTTCCTGAGAGGAAGCCAAGGCATAG AGCAGAAGAGAGGGAGGGACTCTCCCTTCCCAGCTTCCTGCTCATTGTCAGCTTATAGGCAGCCCTTGCAG CTTCTCCCATCTACCCAAAGGGTGAAATAATACCTACCTCACAGGACTGCAGTGAGGCTTGGTGAGATTTTTG TGTTTTTTGTTTTTTTGGCCTGGCTTGGAAAGGCACTGGGAAACAAGGCTAATAACCAGCGAGAATGTTCCAC ATCTATCCTGTCCTCATCTCTGGTTTGCATCCCAATAATATGCATATGCCTCATTCTTCTTCCTTTAGCAACCTT AGGCATCATGACTCAGATGCTTAAAGCATCTTTGTCCCCGGTTCTTTTTTTTTTTTTTTTTTTTTTTTGATGGAG GTACCTGGGACTATGGGAGTACTTTTTTATATTGTTGTTGCCCCAATGCCTGTGATAAATACTAGCGTTTAATG GATAGGGATTAAGAGCACAAATCTCAGTCC TCTTAACAAAGAATGTCTGGCCTAGTGCTAGCGGCATGCCTGTGCAGGCATTACCACGGATTGTGTTAGAAT GTATATTTGCAAAGCCATTTTCTCTAGCCAGACCCTCTGACAGGCAAGTCTTCAAATAGCGATCTCAGGGTTG CTGAGGTTGGTCCCGGTGCCAGTGGGCTACAGCACCTCTCATACGGTTGACTTTGGGGAAACCTGGACCCA TGCAGTTGTGTTGACCTTGATGTCAGTGAGACCAAAGACAAAGCACAAGTACCTTACTCTTGACTTCCAAATA AACTTCTGCCCTTGAGGGCTCAGAAAA Supplement 2