Gene Annotation Project. Group 1. Tyler Tiede Yanzhu Ji Jenae Skelton

Gene Annotation Project Group 1 Tyler Tiede Yanzhu Ji Jenae Skelton

Outline Tools Overview of 150kb region Overview of annotation process Characterization of 5 putative gene regions Analysis of masked regions

Annotation Tools Sequence analyses EMBOSS tools Dot Plot, word frequency, RepeatMasker, Nucleotide Density, and CpG Island Gene predictions FGeneSH, AUGUSTUS, GeneMark In the end used only FGeneSH and GeneMark b/c AUGUSTUS did not add additional information Alignments NCBI, TIGR, GRAMENE blastn, blastx, blastp Genome viewer MaizeGDB

Gramene Blast result 150kb region from Chr8:138800001..138950000 in the maize reference genome

Intrinsic Sequence Analysis GC content 49.77% 99,668bp (66.45%) of bases masked

Gene 1 4 exons Reverse Strand 844 bp coding sequence Gene Model Exon Start End Exon Length Evidence for Start Evidence for End 4 3 2 1 45029 (1022) 45431 (1424) 45673 (1666) 45862 (1855) 45349 (1342) 45593 (1586) 45796 (1789) 46100 (2093) 320 160 123 238 Gene1:EST1.5; 37375054 "Exon 4" Prediction; Gene1:cDNA1.4; Gene1:cDNA3.4; Gene1:cDNA4.4 Gene1:cDNA1.0; Gene1:cDNA2.0; Gene1:cDNA3.0; Gene1:cDNA4.0 "Exon 2" Prediction; Gene1:cDNA2.2; Gene1:cDNA4.2 Gene1:cDNA1.5; Gene1:cDNA3.5; "Exon 5" Predicted End "Exon 4" Prediction; Gene1:cDNA1.4; Gene1:cDNA3.4; Gene1:cDNA4.4 Gene1:cDNA2.0; Gene1:cDNA3.0; Gene1:cDNA4.0 Gene1:cDNA2.2; 211384078; and weak support by Gene1:EST1.2 FGeneSH Prediction GeneMark Prediction Exon Strand Start End Start End 1-46157 46173 2-45862 45979 3-45649 45656 4-45431 45593 45431 45550 5-45230 45349 45230 45349 Coding sequence of gene model

Gene 1 cont. Predicted exons 1 and 3 supported by EST and cdna Exon 2 not predicted by either software Predicted exon 4 partially supported by EST and cdna Overall, expression supported by ESTs in MaizeGDB and NCBI cdna/est summary from NCBI blastn Accession ID Query Range Relation to Predicted Exons % Match E. Value gb FL442439.1 Gene1:cDNA1.5 1031-1343 5 99 6e^-156 Gene1:cDNA1.4 1423-1585 4 98 e^-72 Gene1:cDNA1.0 1667-1767 - 99 2e^-41 gb FL471335.1 Gene1:cDNA2.2 1855-2094 2 95 e^-102 Gene1:cDNA2.0 1667-1790 - 98 e^-52 Gene1:cDNA2.4 1433-1537 4 98 e^-42 gb FK984278.1 Gene1:cDNA3.4 1423-1585 4 98 e^-72 Gene1:cDNA3.5 1194-1343 5 97 2e^-61 Gene1:cDNA3.0 1667-1790 - 98 e^-52 Gene1:cDNA4.4 1423-1585 gb CO446956.1 4 96 e^-65 Gene1:cDNA4.0 1667-1790 - 98 7e^-51 Gene1:cDNA4.2 1855-1977 2 97 e^-48 TA216465 4577 Gene1:EST1.5 1023-1361 5 96 6.1e^-122 Gene1:EST1.0 1662-1794 - 95 1.3e^-112 Gene1:EST1.2 1849-2032 2 83 6.1e^-122 Gene1:EST1.1 2115-2441 1 63 1.9e^-114

Gene 1 cont. NCBI blastx with model sequence NCBI blastp w/ FGeneSH predicted protein as query 4 exon gene model better supported than FGeneSH prediction by cdna and EST Expression supported by ESTs and some cdna blastx highest hit 64% match (e^-30) to a hypothetical protein hits of lesser extent also include hypothetical proteins blastp of FGeneSH predicted AA sequence yielded worse results (E.values >2) tblastx of model coding sequence provided no results Conclusion region codes for ncrna novel protein not yet characterized

Gene 2 3 exon model Forward strand 1248 bp coding sequence Possible homolog to candidate gene: 1-aminocyclopropane-1-carboxylase oxidase 1 1268 bp Gene Model Exon Exon Start Exon Stop Exon Length Evidence for Start Evidence for Stop Gene2:mRNAcds1.1; 1 61225 61426 Gene Predictions; 201 Gene2:mRNAcds1.1 (35) (235) 195627159 from MaizeGDB 2 3 61545 (355) 61911 (721) 61789 (599) 62714 (1524) 244 Gene2:mRNAcds1.2; Gene Predictions 803 Gene2:mRNAcds1.3 Gene2:mRNAcds1.2; Gene Predictions; 195627159 from MaizeGDB Gene2:mRNAcds1.3; Gene Predictions; 195627159 from MaizeGDB FGeneSH Prediction GeneMark Prediction Exon Strand Start End Start End 1 + 61318 61426 61318 61425 2 + 61545 61789 61545 51789 3 + 61911 62499 61911 62499 FGeneSH predicted coding sequence, 942 bp: ATGGAGATTCCGGTGATCGATCTCGGCGGCCTCAACGGCGGCGGCGAGGAGAG GTCGCGGACCTTGGCGGAGCTCCACGACGCCTGCAAGGACTGGGGCTTCTTCTG GGTGGAGAACCACGGCGTGGACGCGCCGCTGATGGACGAGGTCAAGCGCTTCG TCTACGGCCACTACGAGGAGCACCTGGAGGCCAAGTTCTACGCCTCCGCCCTCG CCATGGACCTCGAGGCCGCCACCAGAGGTGACACTGATGAGAAGCCCTCCGAC GAGGTGGACTGGGAGTCCACCTACTTCATCCAGCACCACCCCAAGACCAACGTC GCCGACTTCCCAGAGATCACGCCGCCGACACGAGAGACGCTGGACGCGTACGT CGCGCAGATGGTGTCCCTCGCGGAGCGTCTGGCCGAGTGCATGAGCCTCAACCT GGGCCTCCCCGGGGCCCACGTCGCCGCCACCTTCGCGCCGCCGTTCGTGGGCAC CAAGTTCGCCATGTACCCGTCCTGCCCGCGCCCGGAGCTGGTGTGGGGCCTGCG CGCGCACACCGACGCCGGCGGCATCATCCTGCTCCTCCAGGACGACGTCGTGGG CGGCCTCGAGTTCCTCAGGGCCGGCGCCCACTGGGTCCCCGTCGGCCCCACCAA GGGGGGCAGGCTCTTCGTCAACATCGGGGACCAGATCGAGGTCCTCAGCGCCG GCGCCTACCGGAGCGTCCTGCACCGCGTCGCGGCCGGGGACCAGGGCCGCCGC CTGTCCGTGGCCACGTTCTACAACCCTGGCACCGACGCCGTGGTCGCGCCGGCG CCCCGCAGGGATCAGGACGCCGGCGCCGCGGCGTACCCCGGTCCCTACAGGTTC GGGGACTACCTCGACTACTACCAGGGCACCAAGTTCGGCGACAAGGACGCCAG GTTCCAGGCCGTCAAGAAGCTGCTCGGCTAA

Gene 2 cont. High match (almost 100%, E.value basically 0) to maize 1-aminocyclopropane-1-carboxylate oxidase 1 Many >>10 ESTs align to region, suggests that gene 2 is expressed Many blastx and blastp alignments to candidate gene in many other species, top 8 in table below Gene 2 may be a homolog to candidate gene Gene Model Match to Candidate Gene Accession ID Query Range Relation to Predicted Exons % Match E. Value Gene2:mRNAcd s1.3 720-1524 3 99 0 Gene ID: 100283053; 1- Gene2:mRNAcd aminocyclopropane-1-354 - 599 2 100 5e^-124 carboxylase oxidase 1 s1.2 Gene2:mRNAcd 35-237 1 100 4e^-100 s1.1 cdna from MaizeGDB (below) 3 exons -Coordinates: 35-1524 blastx results Top Hits from blastp and blastx to 1- aminocyclopropane-1-carboxylate oxidase 1 Organism % Match E.value Arabidopsis thaliana 50% 2e^-105 clove pink 43% 5e^-86 Indian rice 45% 5e^-85 Japanese rice 45% 3e^-85 Kiwifruit 43% 7e^-85 Arabidopsis thaliana (L.) Heynh 43% 4e^-83 Apple 42% e^-82 Tomatoe 42% 4e^-82

Gene 3 8 or 9 exons possible alternative splicing 4150 bp of coding sequence for model 1; 3589 bp for model 2 Forward strand Candidate gene: lycopene epsilon cyclase 1 (lyce1) FGeneSH Prediction GeneMark Prediction Exon Strand Start End Start End 1 + 82817 83131 82817 83131 2 + 83224 83315 83270 83315 3 + 83394 83460 83420 83460 4 + 84163 84168 84163 84168 5 + 84287 84458 84287 84458 6 + 84568 84703 84568 84703 7 + 84808 85021 84919 85021 8 + 85172 85315 85406 85513 9 + 85406 85513 85611 85682 10 + 85586 85682 85986 86048 11 + 85759 85887 87022 87223 12 + 87029 87508 87302 87396 13 + 87665 88190 87494 87615 14 + 88289 89626 87686 88190 15 + - - 88289 89626 Exons 1-7 of models match MaizeGDB model, whose CDS is below:

Gene 3 cont. cdna and EST support for expression of exons potential alternative splicing mrna evidence Associated Predicted Exon(s) (FGSH/GM) % Match E.value Gene3:cDNA1.14/15 5892 7376 14 and 15 99 0 Accession ID Start End gb BT037027.1; GENE ID: 100216601 LOC100216601 gb BT067056.1 gb BT063754.1 ; GENE ID: 100280448 lyce1 gb EU924262.1 / lcye- W22 allele **B73 allel supports model Gene3:cDNA1.13/14 5191 5794 13 and 14 100 0 Gene3:cDNA2.14/15 5892 7386 14 and 15 93 0 Gene3:cDNA2.13/14 5290 5794 13 and 14 88 2e^-164 Gene3:cDNA2.0/0 4448 4817 none and none 87 6e^-110 Gene3:cDNA2.0/13 5097 5218 none and 13 94 e^-42 Gene3:cDNA2.0/12 4905 5007 none and 12 95 e^-35 Gene3:cDNA3.11/0 3360 4245 11 and none 100 0 Gene3:cDNA3.1/1 329 734 1 and 1 100 0 Gene3:cDNA3.7/7 2409 2626 7 and 7 100 e^-107 Gene3:cDNA3.5/5 1888 2061 5 and 5 100 3e^-83 Gene3:cDNA3.8/0 2773 2919 8 and none 100 3e^-68 Gene3:cDNA3.6/6 2169 2306 6 and 6 100 3e^-63 Gene3:cDNA3.9/8 3007 3117 9 and 8 100 3e^-48 Gene3:cDNA3.10/9 3187 3289 10 and 9 99 4e^-42 Gene3:cDNA4.1/1 281 734 1 and 1 99 0 Gene3:cDNA4.7/7 2409 2626 7 and 7 100 e^-107 Gene3:cDNA4.0/0 3828 4017 none and none 100 4e^-92 Gene3:cDNA4.5/5 1888 2061 5 and 5 100 3e^-83 Gene3:cDNA4.8/0 2773 2919 8 and none 100 3e^-68 Gene3:cDNA4.6/6 2169 2306 6 and 6 100 3e^-63 Gene3:cDNA4.0/10 3588 3722 none and 10 100 e^-61 Gene3:cDNA4.11/0 3360 3490 11 and noen 100 2e^-59 Gene3:cDNA4.9/8 3007 3117 9 and 8 100 3e^-48 Gene3:cDNA4.10/9 3187 3289 10 and 9 99 4e^-42 GENE ID: 100216601 LOC100216601 lycopene epsilon cyclase1 [Zea mays]

Gene 3 cont. blastx using model 2 CDS as query -When expanded the "NADB_Rossmann superfamily" (blue bars) in all three reading frames are exactly lined up with domains of lyce1. -Model 1 similar to model 2 except NADB_Rossman domain truncated at 3 end blastx of MaizeGDB gene model Organism Arabidopsis thaliana Tomatoe Tobacco % Match E.Value 67% 0 72% 0 38% 4e^-89 blastp using MaizeGDB lyce1 protein sequence as query Conclusion: blastp of MaizgGDB lyce1 protein sequence resulted in a perfect match to Zea mays lcye1 (E.value = 0) PKc-like superfamily domain on 3 end of model sequences suggest that exon 9 and 10 of model 1 (10 and 11 of model 2) can themselves be their own gene model for a PKc_like superfamily protein. cdna and EST evidence and a blastx match exists to support the gene model suggestthat GENE ID: 100216601 LOC100216601 may have been mistakenly named

Gene 4 4 exons Forward strand 1820 bp coding sequence Expression and exon positions supported by cdna and ESTs FGeneSH Prediction GeneMark Prediction Exon Strand Start End Size Start End Size 1 + - - - 1 130 130 2 + 537 738 201 509 738 229 3 + 1310 1570 260 1310 1570 260 4 + 1657 2547 890 1657 2547 890 5 + 3024 3493 469 3024 3493 469 EST evidence below FGeneSH Model Exon Start End Size 1 2 3 4 109800 (537) 110573 (1310) 110920 (1657) 112287 (3024) 110001 (738) 110833 (1570) 111810 (2547) 112756 (3493) 201 260 890 469

Gene 4 cont. blastx of model coding sequence results in a hit to a Pkc_like superfamily domain 10+ blastx hits with E.values ranging from 6e^-39 to 5e^-43, ~35% identity cdna hits, while strong matches, do not provide any additional information cdna hits

Gene 5 2 exon model Reverse strand 604 bp coding sequence FGeneSh GeneMark Exon Start Stop Start Stop 1 2 142412 (753) 142968 (1309) 142456 (797) - - 143528 (1869) 142968 (1309) 143528 (1869) cdna support for exon 2 However, upstream, around 141,659 143,000 the query matches cdna of transposons and cdna and ESTs of random gene fragments cdna and blastp (of FGeneSH predicted protein) match to HLH superfamily (cdna: 81%, 3e^-123) according to NCBI HLH is common in DNAbinding proteins such as transcription factors

Repetitive Region Validation Skip the regions with predicted genes Database search DNA-level Maize TE database Protein-level Swiss-Prot

BLASTn against maize TE databas

Region 3: 62729, 82396 1-aminocyclopropane- 1-carboxylate oxidase 1 Transposonrelated protein BLASTx against Swiss-Prot

Region 5: 113683, 138555 BLASTx 30S ribosomal protein S4, Chloroplast! BLASTn (Maize TE database)

Region 5 cont: 113683, 138555 BLASTx, nr

Summary 5 regions harboring genes predicted 1 possible ncrna coding region 2 candidate gene hits 1-aminocyclopropane-1-carboxylate oxidase 1 homolog Lycopene epsilon cyclase 1 With PKC_like superfamily slightly downstream 1 Pkc_like superfamily hit 1 region likely resulting from helitron insertion Potential expression of a transcription factor Further analyses on repetitve regions support repeatmasker results