Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018

Similar documents
Agenda. Annotation of Drosophila. Muller element nomenclature. Annotation: Adding labels to a sequence. GEP Drosophila annotation projects 01/03/2018

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

Annotation of a Drosophila Gene

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017

Small Exon Finder User Guide

Annotation Walkthrough Workshop BIO 173/273 Genomics and Bioinformatics Spring 2013 Developed by Justin R. DiAngelo at Hofstra University

Annotation of contig27 in the Muller F Element of D. elegans. Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans.

Annotation of contig62 from Drosophila elegans Dot Chromosome

Annotation of Contig8 Sakura Oyama Dr. Elgin, Dr. Shaffer, Dr. Bednarski Bio 434W May 2, 2016

Transcription Start Sites Project Report

Outline 08/2018. Searching for Transcription Start Sites in Drosophila. TSS of F element genes show lower levels of H3K9me3 and HP1a

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions

Draft 3 Annotation of DGA06H06, Contig 1 Jeannette Wong Bio4342W 27 April 2009

Drosophila ficusphila F element

Lab Week 9 - A Sample Annotation Problem (adapted by Chris Shaffer from a worksheet by Varun Sundaram, WU-STL, Class of 2009)

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?

Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G

MODULE 5: TRANSLATION

Annotation of Drosophila erecta Contig 14. Kimberly Chau Dr. Laura Hoopes. Pomona College 24 February 2009

MODULE TSS1: TRANSCRIPTION START SITES INTRODUCTION (BASIC)

ab initio and Evidence-Based Gene Finding

MODULE TSS2: SEQUENCE ALIGNMENTS (ADVANCED)

Annotating Fosmid 14p24 of D. Virilis chromosome 4

Annotating the D. virilis Fourth Chromosome: Fosmid 99M21

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence

Aaditya Khatri. Abstract

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

Gene Identification in silico

Last Update: 12/31/2017. Recommended Background Tutorial: An Introduction to NCBI BLAST

Mapping strategies for sequence reads

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

Chimp Chunk 3-14 Annotation by Matthew Kwong, Ruth Howe, and Hao Yang

Sequence Based Function Annotation

Motif Discovery in Drosophila

Supplementary Information

CS313 Exercise 1 Cover Page Fall 2017

Identifying Genes and Pseudogenes in a Chimpanzee Sequence Adapted from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. M.

Chimp Sequence Annotation: Region 2_3

ORTHOMINE - A dataset of Drosophila core promoters and its analysis. Sumit Middha Advisor: Dr. Peter Cherbas

BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology

BME 110 Midterm Examination

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS.

Assessing De-Novo Transcriptome Assemblies

Gene Finding Genome Annotation

ChIP-seq and RNA-seq. Farhat Habib

Annotating the Genome (H)

user s guide Question 1

ChIP-seq and RNA-seq

Answer: Sequence overlap is required to align the sequenced segments relative to each other.

Hands-On Four Investigating Inherited Diseases

MAKER: An easy to use genome annotation pipeline. Carson Holt Yandell Lab Department of Human Genetics University of Utah

Finding Genes, Building Search Strategies and Visiting a Gene Page

Analysis of RNA-seq Data

Basic Bioinformatics: Homology, Sequence Alignment,

Lecture 7 Motif Databases and Gene Finding

COMPUTER RESOURCES II:

Finding Genes, Building Search Strategies and Visiting a Gene Page

Gene Prediction. Mario Stanke. Institut für Mikrobiologie und Genetik Abteilung Bioinformatik. Gene Prediction p.

Gene Model Annotations for Drosophila melanogaster: Impact of High throughput Data

FUNCTIONAL BIOINFORMATICS

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

Gene Prediction in Eukaryotes

Week 1 BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Genome annotation & EST

Transcriptome Assembly, Functional Annotation (and a few other related thoughts)

Biology. Biology. Slide 1 of 39. End Show. Copyright Pearson Prentice Hall

Biology. Biology. Slide 1 of 39. End Show. Copyright Pearson Prentice Hall

Gene-centered resources at NCBI

DNA is normally found in pairs, held together by hydrogen bonds between the bases

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

RNA-Sequencing analysis

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013)

Chimp BAC analysis: Adapted by Wilson Leung and Sarah C.R. Elgin from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. Michael R.

Investigating Inherited Diseases

De novo assembly in RNA-seq analysis.

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES

Gene Annotation Project. Group 1. Tyler Tiede Yanzhu Ji Jenae Skelton

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility

Finishing Drosophila Grimshawi Fosmid Clone DGA19A15. Matthew Kwong Bio4342 Professor Elgin February 23, 2010

Finishing of DELE Drosophila elegans has been sequenced using Roche 454 pyrosequencing and Illumina

Eukaryotic Gene Structure

GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment

FINDING GENES AND EXPLORING THE GENE PAGE AND RUNNING A BLAST (Exercise 1)

measuring gene expression December 5, 2017

Computational gene finding

Bacterial Genome Annotation

9/19/13. cdna libraries, EST clusters, gene prediction and functional annotation. Biosciences 741: Genomics Fall, 2013 Week 3

Fine scale structural variants distinguish the genomes of Drosophila melanogaster and D. pseudoobscura

RNA-SEQUENCING ANALYSIS

Bioinformatics Course AA 2017/2018 Tutorial 2

Exercise I, Sequence Analysis

Computational gene finding

BIOLOGY - CLUTCH CH.17 - GENE EXPRESSION.

ELE4120 Bioinformatics. Tutorial 5

RNA-Seq Analysis. August Strand Genomics, Inc All rights reserved.

Transcription:

Outline Overview of the GEP annotation projects Annotation of Drosophila Primer January 2018 GEP annotation workflow Practice applying the GEP annotation strategy Wilson Leung and Chris Shaffer AAACAACAATCATAAATAGAGGAAGTTTTCGGAATATACGATAAGTGAAATATCGTTCT TAAAAAAGAGCAAGAACAGTTTAACCATTGAAAACAAGATTATTCCAATAGCCGTAAGA GTTCATTTAATGACAATGACGATGGCGGCAAAGTCGATGAAGGACTAGTCGGAACTGGA AATAGGAATGCGCCAAAAGCTAGTGCAGCTAAACATCAATTGAAACAAGTTTGTACATC GATGCGCGGAGGCGCTTTTCTCTCAGGATGGCTGGGGATGCCAGCACGTTAATCAGGAT ACCAATTGAGGAGGTGCCCCAGCTCACCTAGAGCCGGCCAATAAGGACCCATCGGGGGG GCCGCTTATGTGGAAGCCAAACATTAAACCATAGGCAACCGATTTGTGGGAATCGAATT TAAGAAACGGCGGTCAGCCACCCGCTCAACAAGTGCCAAAGCCATCTTGGGGGCATACG CCTTCATCAAATTTGGGCGGAACTTGGGGCGAGGACGATGATGGCGCCGATAGCACCAG CGTTTGGACGGGTCAGTCATTCCACATATGCACAACGTCTGGTGTTGCAGTCGGTGCCA TAGCGCCTGGCCGTTGGCGCCGCTGCTGGTCCCTAATGGGGACAGGCTGTTGCTGTTGG TGTTGGAGTCGGAGTTGCCTTAAACTCGACTGGAAATAACAATGCGCCGGCAACAGGAG CCCTGCCTGCCGTGGCTCGTCCGAAATGTGGGGACATCATCCTCAGATTGCTCACAATC ATCGGCCGGAATGNTAANGAATTAATCAAATTTTGGCGGACATAATGNGCAGATTCAGA ACGTATTAACAAAATGGTCGGCCCCGTTGTTAGTGCAACAGGGTCAAATATCGCAAGCT CAAATATTGGCCCAAGCGGTGTTGGTTCCGTATCCGGTAATGTCGGGGCACAATGGGGA GCCACACAGGCCGCGTTGGGGCCCCAAGGTATTTCCAAGCAAATCACTGGATGGGAGGA ACCACAATCAGATTCAGAATATTAACAAAATGGTCGGCCCCGTTGTTATGGATAAAAAA TTTGTGTCTTCGTACGGAGATTATGTTGTTAATCAATTTTATTAAGATATTTAAATAAA TATGTGTACCTTTCACGAGAAATTTGCTTACCTTTTCGACACACACACTTATACAGACA GGTAATAATTACCTTTTGAGCAATTCGATTTTCATAAAATATACCTAAATCGCATCGTC Start codon Coding region Stop codon Splice donor Splice acceptor UTR GEP Drosophila annotation projects D. melanogaster D. simulans D. sechellia D. yakuba D. erecta D. ficusphila D. eugracilis D. biarmipes D. takahashii D. elegans D. rhopaloa D. kikkawai D. bipectinata D. ananassae D. pseudoobscura D. persimilis D. willistoni D. mojavensis D. virilis D. grimshawi Reference Published TSS projects for Fall 2017 / Spring 2018 Annotation projects for Fall 2017 / Spring 2018 New species sequenced by modencode Phylogenetic tree produced by Thom Kaufman as part of the modencode project Muller element nomenclature Gene structure nomenclature Schaeffer SW et al, 2008. Polytene Chromosomal Maps of 11 Drosophila Species: The Order of Genomic Scaffolds Inferred From Genetic and Physical Maps. Genetics. 2008 Jul;179(3):1601-55 Exon UTR CDS Primary mrna Protein Gene span Exons UTR s CDS s 1

GEP annotation goals Identify and annotate all genes in your project For each gene, identify and precisely map (accurate to the base pair) all Coding DNA Sequences (CDS) Do this for ALL isoforms Annotate the initial transcribed exon and transcription start site (TSS) Optional analyses not submitted to GEP Clustal analysis (proteins, promoter regions) Transposons and other repeats Synteny Non-coding genes Evidence for gene models (in general order of importance) 1. Conservation Sequence similarity to genes in D. melanogaster Sequence similarity to other Drosophila species (Multiz) 2. Expression data RNA-Seq, EST, cdna 3. Computational predictions Open reading frames; gene and splice site predictions 4. Tie-breakers of last resort See the Annotation Instruction Sheet Basic annotation workflow Annotation workflows available under the Introducing Genes section of the GEP web site Four main web sites used by the GEP annotation strategy 1. GEP UCSC Genome Browser (http://gander.wustl.edu) 2. FlyBase (http://flybase.org) Tools è Genomic/Map Tools è BLAST Jump to Gene è Genomic Location è GBrowse 3. Gene Record Finder (http://gep.wustl.edu) Projects è Annotation Resources 4. NCBI BLAST (https://blast.ncbi.nlm.nih.gov/blast.cgi) BLASTX è select the checkbox: Annotation workflow: Step 1 Two different versions of the UCSC Genome Browser Official UCSC Version http://genome.ucsc.edu Published data, lots of species, whole genomes; used for Chimp Chunks GEP Version http://gander.wustl.edu GEP projects, parts of genomes; used for annotation of Drosophila species 2

UCSC Genome Browser overview Genomic sequence Evidence tracks Control how evidence tracks are displayed on the Genome Browser Five different display modes: Hide: track is hidden Dense: all features appear on a single line Squish: overlapping features appear on separate lines Features are half the height compared to full mode Pack: overlapping features appear on separate lines Features are the same height as full mode Full: each feature is displayed on its own line Set Base Position track to Full to see the amino acid translations Some evidence tracks (e.g., RepeatMasker) only have a subset of these display modes GEP annotation strategy Use D. melanogaster as reference D. melanogaster is very well annotated Use sequence similarity to infer homology DEMO: Examine contig10 in the D. biarmipes Aug. 2013 (GEP/Dot) assembly with the GEP UCSC Genome Browser Minimize changes compared to the D. melanogaster gene model (parsimony) Coding sequences evolve slowly Exon structure changes very slowly FlyBase Database for the Drosophila research community Overview of NCBI BLAST Detect local regions of significant sequence similarity between two sequences Lots of ancillary data for each gene in D. melanogaster Curation of literature for each gene Reference for D. melanogaster annotations for all other databases Including NCBI, EBI, and DDBJ Fast release cycle (6-8 releases per year) Decide which BLAST program to use based on the type of query and subject sequences: Program Query Database (Subject) BLASTN Nucleotide Nucleotide BLASTP Protein Protein BLASTX Nucleotide Protein Protein TBLASTN Protein Nucleotide Protein TBLASTX Nucleotide Protein Nucleotide Protein 3

Where can I run BLAST? Accessing TBLASTX at NCBI NCBI BLAST web service https://blast.ncbi.nlm.nih.gov/blast.cgi EBI BLAST web service https://www.ebi.ac.uk/tools/sss/ FlyBase BLAST (Drosophila and other insects) http://flybase.org/blast/ Annotation workflow: Step 2 DEMO: Ortholog assignment for the N-SCAN prediction contig10.001.1 Gene Record Finder Observe the structure of D. melanogaster genes Retrieves CDS and exon sequences for each gene in D. melanogaster CDS and exon usage maps for each isoform List of unique CDS Designed for the exon-by-exon annotation strategy Nomenclature for Drosophila genes Drosophila gene names are case-sensitive Lowercase initial letter = recessive mutant phenotype Uppercase initial letter = dominant mutant phenotype Every D. melanogaster gene has an annotation symbol Begins with the prefix CG (Computed Gene) Some genes have a different gene symbol (e.g., ey) Suffix after the gene symbol denotes different isoforms mrna = -R; protein = -P ey-ra = Transcript for the A isoform of ey ey-pa = Protein product for the A isoform of ey 4

Be aware of different annotation releases D. melanogaster Release 6 genome assembly First change of the assembly since late 2006 Most modencode analysis used the Release 5 assembly Gene annotations change much more frequently Use FlyBase as the canonical reference GEP data freeze GEP materials are updated before the start of semester Potential discrepancies in results and screenshots See the archived BLAST results in the exercise package Let us know about major errors or discrepancies DEMO: Determine the gene structure of the D. melanogaster gene CG31997 Annotation workflow: Step 3 BLAST parameters for CDS mapping Select the Align two or more sequences checkbox Settings in the Algorithm parameters section Verify the Word size is set to 3 Turn off compositional adjustments Turn off the low complexity filter Strategies for CDS mapping Start by mapping the largest CDS The first and last CDS tend to be smaller than internal CDS in Drosophila Continue mapping CDS by size in descending order Defer mapping small or weakly conserved CDS Use placements of adjacent CDS to define the search region Use the splice donor and acceptor phases of adjacent CDS as additional constraints Strategies for finding small CDS Examine RNA-Seq coverage and TopHat junctions Small CDS is typically part of a larger transcribed exon Use Query subrange to restrict the search region Increase the Expect threshold and try again Keep increasing the Expect threshold until you get matches Also try decreasing the word size Use the Small Exon Finder Minimize changes in CDS size Available under Projects è Annotation Resources See the Annotation Instruction Sheet for details 5

EXERCISE: Map each CDS to the project sequence Use BLASTX to determine the approximate locations for the three CDS of CG31997 on contig10 Consult with each other DEMO: Map CDS 3_10798_1 of CG31997 against contig10 with BLASTX The Annotation of a Drosophila Gene document provides a step-by-step walkthrough Discussions and coffee break Annotation workflow: Step 4 Carolina Ponce: https://flic.kr/p/othbqv duncan c: https://flic.kr/p/nsfe14 Basic biological constraints (inviolate rules*) Coding regions start with a methionine Coding regions end with a stop codon Gene should be on only one strand of DNA Exons appear in order along the DNA (collinear) Intron sequences should be at least 40 bp Intron starts with a GT (or rarely GC) Intron ends with an AG * There are known exceptions to each rule modencode RNA-Seq data RNA-Seq evidence tracks: RNA-Seq coverage (read depth) TopHat splice junction predictions Assembled transcripts (Cufflinks, Oases) Positive results very helpful Negative results less informative Lack of transcription no gene GEP curriculum: RNA-Seq Primer Browser-Based Annotation and RNA-Seq Data 6

Overview of RNA-Seq (Illumina) 5 cap Poly-A tail Processed mrna AAAAAA RNA fragments (~250bp) Library with adapters 5 3 5 3 5 3 5 3 ~125bp Paired end sequencing 5 3 ~125bp RNA-Seq reads Forward Reverse DEMO: Use RNA-Seq coverage to support the placement of the start codon Wang Z et al. (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 10(1):57-63. Can use the TopHat splice junction predictions to identify splice sites 5 cap M Poly-A tail Processed mrna AAAAAA * RNA-Seq reads EXERCISE: Confirm the placement of the stop codon for CDS 3_10798_1 Contig TopHat junctions Intron Intron A genomic sequence has 6 different reading frames A codon could be derived from nucleotides in adjacent exons Frames 1 23 2 Spliced mrna CTG AGA GAT TTT CCG Phase 0 Phase 0 CTG AGA GT AG GAT TTT CCG Phase 1 Phase 2 CTG AGA G GT AG AT TTT CCG Frame: Base to begin translation relative to the start of the sequence Phase 2 Phase 1 CTG AGA GA GT AG T TTT CCG Donor Acceptor Intron 7

Splice donor and acceptor phases Phase depends on the reading frame Phase: Number of bases between the complete codon and the splice site Donor phase: Number of bases between the end of the last complete codon and the splice donor site (GT/GC) Acceptor phase: Number of bases between the splice acceptor site (AG) and the start of the first complete codon Phase depends on the reading frame of the CDS Phase of donor site: Phase 2 relative to frame +1 Phase 1 relative to frame +2 Phase 0 relative to frame +3 Splice donor Phase of the donor and acceptor sites must be compatible Extra nucleotides from donor and acceptor phases form an additional codon Donor phase + acceptor phase = 0 or 3 Incompatible donor and acceptor phases result in a frame shift CTG AGA G GT GT AG AT TTT CCG CTG AGA G GT AG AT TTT CCG CTG AGA GAT TTT CCG Translation: L R D F P CTG AGA GGT ATT TTC CG Translation: L R G I F Phase 0 donor is incompatible with phase 2 acceptor DEMO: Use RNA-Seq to annotate the intron between CDS 1_10798_0 and 2_10798_2 of the CG31997 ortholog EXERCISE: Determine the coordinates for CDS 2_10798_2 and 3_10798_1 of the CG31997 ortholog 8

Annotation workflow: Step 5 Verify the final gene model using the Gene Model Checker Gene model should satisfy biological constraints Explain errors or warnings in the GEP Annotation Report Compare model against the D. melanogaster ortholog Dot plot and protein alignment See Quick check of student annotations View your gene model as a custom track in the genome browser Generate files require for project submission Annotation workflow: Step 6 DEMO: Verify the proposed gene model for the ortholog of CG31997 Next step: practice annotation D. biarmipes Aug. 2013 (GEP/Dot) assembly Questions? Annotation of a Drosophila Gene onecut on contig35 ey on contig40 CG1909 on contig35 Difficulty Arl4 and CG33978 on contig10 9