Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

Similar documents
Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018

Agenda. Annotation of Drosophila. Muller element nomenclature. Annotation: Adding labels to a sequence. GEP Drosophila annotation projects 01/03/2018

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions

Chimp Sequence Annotation: Region 2_3

Chimp Chunk 3-14 Annotation by Matthew Kwong, Ruth Howe, and Hao Yang

ORTHOMINE - A dataset of Drosophila core promoters and its analysis. Sumit Middha Advisor: Dr. Peter Cherbas

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?

ELE4120 Bioinformatics. Tutorial 5

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

Why learn sequence database searching? Searching Molecular Databases with BLAST

Types of Databases - By Scope

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence

Assessing De-Novo Transcriptome Assemblies

user s guide Question 3

Guided tour to Ensembl

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Fine scale structural variants distinguish the genomes of Drosophila melanogaster and D. pseudoobscura

RNA-Sequencing analysis

Hands-On Four Investigating Inherited Diseases


BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

user s guide Question 3

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

Protein Bioinformatics Part I: Access to information

Gene-centered resources at NCBI

FINDING GENES AND EXPLORING THE GENE PAGE AND RUNNING A BLAST (Exercise 1)

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Bioinformatics for Proteomics. Ann Loraine

Genome Annotation. What Does Annotation Describe??? Genome duplications Genes Mobile genetic elements Small repeats Genetic diversity

Genome and DNA Sequence Databases. BME 110: CompBio Tools Todd Lowe April 5, 2007

Introduction to the UCSC genome browser

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

NCBI Reference Sequences: current status, policy and new initiatives

Interpreting RNA-seq data (Browser Exercise II)

Product Applications for the Sequence Analysis Collection

Biotechnology Explorer

Basic Bioinformatics: Homology, Sequence Alignment,

Sequence Databases and database scanning

NCBI web resources I: databases and Entrez

Curating sequence and literature data for RefSeq and Gene Kim D. Pruitt 8 th International Biocuration Conference Training workshop April 23, 2015

BLAST. Basic Local Alignment Search Tool. Optimized for finding local alignments between two sequences.

Sequence Databases. Chapter 2. caister.com/bioinformaticsbooks. Paul Rangel. Sequence Databases

Gene Signal Estimates from Exon Arrays

Epigenetics Evolution and Replacement Histones: Evolutionary Changes at Drosophila H2AvD

(C-D)(E-F) (C-D) (A-B)(C-D)(E-F) D. simulans. D. sechellia. D.melanogaster. D. yakuba. D. erecta. D. ananassae D. pseudoobscura.

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

Genome Annotation. Stefan Prost 1. May 27th, States of America. Genome Annotation

Gene Structure & Gene Finding Part II

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Genie Gene Finding in Drosophila melanogaster

I nternet Resources for Bioinformatics Data and Tools

Training materials.

Host : Dr. Nobuyuki Nukina Tutor : Dr. Fumitaka Oyama

Open Access. Abstract

FACULTY OF BIOCHEMISTRY AND MOLECULAR MEDICINE

Evolutionary Genetics. LV Lecture with exercises 6KP

Applied Bioinformatics Exercise Learning to know a new protein and working with sequences

Introduction to RNA-Seq

RNA-Seq Workshop AChemS Sunil K Sukumaran Monell Chemical Senses Center Philadelphia

NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy

Ensembl: A New View of Genome Browsing

RNA-Seq Software, Tools, and Workflows

Bioinformatics, in general, deals with the following important biological data:

Zhengyan Kan 1, Warren Gish 2, Eric Rouchka 1, Jarret Glasscock 2, and David States 1

Integration of data management and analysis for genome research

Introduction to RNA sequencing

Introduction to genome biology

NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins

measuring gene expression December 5, 2017

Introduction to Bioinformatics

Improved annotation with de novo transcriptome assembly in four social amoeba species

Eukaryotic Gene Structure

Mapping strategies for sequence reads

Custom TaqMan Assays DESIGN AND ORDERING GUIDE. For SNP Genotyping and Gene Expression Assays. Publication Number Revision G

Themes: RNA and RNA Processing. Messenger RNA (mrna) What is a gene? RNA is very versatile! RNA-RNA interactions are very important!

Bioinformatic tools for metagenomic data analysis

Year III Pharm.D Dr. V. Chitra

RNA-seq Data Analysis

Chapter 5. explain how information is submitted to and processed by biological databases.

MAKER: An easy-to-use annotation pipeline designed for emerging model organism genomes

RNA-Seq with the Tuxedo Suite

Identification of Single Nucleotide Polymorphisms and associated Disease Genes using NCBI resources

DroSpeGe: rapid access database for new Drosophila species genomes

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction

Evolutionary dynamics and tissue specificity of human long noncoding RNAs in six mammals

Entrez Gene: gene-centered information at NCBI

ARTS: Accurate Recognition of Transcription Starts in human

Next-Generation Sequencing Gene Expression Analysis Using Agilent GeneSpring GX

Genome Sequence Assembly

Next Gen Sequencing. Expansion of sequencing technology. Contents

Molecular Biology Primer. CptS 580, Computational Genomics, Spring 09

Multiple choice questions (numbers in brackets indicate the number of correct answers)

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.

3. human genomics clone genes associated with genetic disorders. 4. many projects generate ordered clones that cover genome

Microarray Gene Expression Analysis at CNIO

Glossary of Commonly used Annotation Terms

Computational Analysis and Experimental Validation of Gene Predictions in Toxoplasma gondii

Ensembl Tools. EBI is an Outstation of the European Molecular Biology Laboratory.

Transcription:

Agenda GEP annotation project overview Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Web databases for Drosophila annotation UCSC Genome Browser NCBI / BLAST FlyBase Gene Record Finder Wilson Leung 01/2018 AAACAACAATCATAAATAGAGGAAGTTTTCGGAATATACGATAAGTGAAATATCGTTCT TAAAAAAGAGCAAGAACAGTTTAACCATTGAAAACAAGATTATTCCAATAGCCGTAAGA GTTCATTTAATGACAATGACGATGGCGGCAAAGTCGATGAAGGACTAGTCGGAACTGGA AATAGGAATGCGCCAAAAGCTAGTGCAGCTAAACATCAATTGAAACAAGTTTGTACATC GATGCGCGGAGGCGCTTTTCTCTCAGGATGGCTGGGGATGCCAGCACGTTAATCAGGAT ACCAATTGAGGAGGTGCCCCAGCTCACCTAGAGCCGGCCAATAAGGACCCATCGGGGGG GCCGCTTATGTGGAAGCCAAACATTAAACCATAGGCAACCGATTTGTGGGAATCGAATT TAAGAAACGGCGGTCAGCCACCCGCTCAACAAGTGCCAAAGCCATCTTGGGGGCATACG CCTTCATCAAATTTGGGCGGAACTTGGGGCGAGGACGATGATGGCGCCGATAGCACCAG CGTTTGGACGGGTCAGTCATTCCACATATGCACAACGTCTGGTGTTGCAGTCGGTGCCA TAGCGCCTGGCCGTTGGCGCCGCTGCTGGTCCCTAATGGGGACAGGCTGTTGCTGTTGG TGTTGGAGTCGGAGTTGCCTTAAACTCGACTGGAAATAACAATGCGCCGGCAACAGGAG CCCTGCCTGCCGTGGCTCGTCCGAAATGTGGGGACATCATCCTCAGATTGCTCACAATC ATCGGCCGGAATGNTAANGAATTAATCAAATTTTGGCGGACATAATGNGCAGATTCAGA ACGTATTAACAAAATGGTCGGCCCCGTTGTTAGTGCAACAGGGTCAAATATCGCAAGCT CAAATATTGGCCCAAGCGGTGTTGGTTCCGTATCCGGTAATGTCGGGGCACAATGGGGA GCCACACAGGCCGCGTTGGGGCCCCAAGGTATTTCCAAGCAAATCACTGGATGGGAGGA ACCACAATCAGATTCAGAATATTAACAAAATGGTCGGCCCCGTTGTTATGGATAAAAAA TTTGTGTCTTCGTACGGAGATTATGTTGTTAATCAATTTTATTAAGATATTTAAATAAA TATGTGTACCTTTCACGAGAAATTTGCTTACCTTTTCGACACACACACTTATACAGACA GGTAATAATTACCTTTTGAGCAATTCGATTTTCATAAAATATACCTAAATCGCATCGTC Start codon Coding region Stop codon Intron donor Intron acceptor UTR Annotation adding labels to a sequence Genes: Novel or known genes, pseudogenes Regulatory Elements: Promoters, enhancers, silencers Non-coding RNA: trnas, mirnas, sirnas, snornas Repeats: Transposable elements, simple repeats Structural: Origins of replication Experimental Results: DNase I Hypersensitive sites ChIP-chip and ChIP-Seq datasets (e.g. modencode) GEP Drosophila annotation projects D. melanogaster D. simulans D. sechellia D. yakuba D. erecta D. ficusphila D. eugracilis D. biarmipes D. takahashii D. elegans D. rhopaloa D. kikkawai D. bipectinata D. ananassae D. pseudoobscura D. persimilis D. willistoni D. mojavensis D. virilis D. grimshawi Reference Published TSS projects for Fall 2017 / Spring 2018 Annotation projects for Fall 2017 / Spring 2018 New species sequenced by modencode Phylogenetic tree produced by Thom Kaufman as part of the modencode project Gene annotation workflow Visualize a genomic region with evidence tracks GEP UCSC Genome Browser Identify interesting features and putative orthologs NCBI BLAST Learn about the putative D. melanogaster ortholog NCBI / FlyBase Understand the gene and isoform structure Gene Record Finder 1

UCSC Genome Browser Provide graphical view of genomic regions Sequence conservation Gene and splice site predictions RNA-Seq and splice junction predictions BLAT BLAST-Like Alignment Tool Map protein or nucleotide sequences against an assembly Faster but less sensitive than BLAST UCSC Genome Browser overview Genomic sequence Evidence tracks Table Browser Access data used to create the graphical browser Control how evidence tracks are displayed on the Genome Browser Most evidence tracks have five display modes: Hide: track is hidden Dense: all features (including overlapping features) are displayed on a single line Squish: overlapping features are drawn on separate lines, features are half the height compared to full mode Pack: overlapping features are drawn on separate lines, features are the same height as full mode Full: Each feature is displayed on its own line Some annotation tracks (e.g., RepeatMasker) only have a subset of these display modes Two different versions of the UCSC Genome Browser Official UCSC Version http://genome.ucsc.edu Published data, lots of species, whole genomes; used for Chimp Chunks GEP Version http://gander.wustl.edu GEP projects, parts of genomes; used for Drosophila annotations Additional training resources Training section on the UCSC web site http://genome.ucsc.edu/training/index.html Videos User guides Mailing lists OpenHelix tutorials and training materials http://www.openhelix.com/ucsc Pre-recorded tutorials Quick reference cards UCSC GENOME BROWSER DEMO 2

Use BLAST to detect sequence similarity BLAST = Basic Local Alignment Search Tool Why is BLAST popular? Provide statistical significance for each match Good balance of sensitivity and speed Find local regions of similarity irrespective of where they are in the sequence Common types of BLAST programs Except for BLASTN, all alignments are based on comparisons of protein sequences Decide which BLAST program to use based on the type of query and subject sequences: Program Query Database (Subject) BLASTN Nucleotide Nucleotide BLASTP Protein Protein BLASTX Nucleotide Protein Protein TBLASTN Protein Nucleotide Protein TBLASTX Nucleotide Protein Nucleotide Protein Common BLAST programs use cases BLASTN: Search for similar nucleotide sequences Map contigs to genome, mrnas/ests to genome BLASTP: Search for proteins similar to predicted genes BLASTX: Map protein / exons against genomic sequence TBLASTN: Map protein against genomic assemblies TBLASTX: Identify genes in unannotated sequences See the Guide to BLAST home and search pages document for details: ftp://ftp.ncbi.nlm.nih.gov/pub/factsheets/howto_blastguide.pdf NCBI BLAST nucleotide databases GenBank Non-Redundant Nucleotide Database (nr/nt) Most comprehensive but some entries are low quality Exclude sequences from whole genome assemblies, ESTs RefSeq RNA Database mrna and non-coding RNA entries from the NCBI Reference Sequence Project Include real and computationally predicted sequences Expressed Sequence Tag (EST) Database ESTs are short single reads of cdna clones High error rate but useful for identifying transcribed loci NCBI BLAST protein databases GenBankNon-Redundant Protein Database (nr) Most comprehensive but some entries are low quality Include sequences from both RefSeq and UniProtKB RefSeq Protein Database Sequences from the NCBI Reference Sequence Project Higher quality than the nr database Include real and computationally predicted sequences UniProtKB / Swiss-Prot Protein Database Manually curated proteins from literature Real proteins with known functions Much smaller database than either RefSeq or nr Where can I run BLAST? NCBI BLAST web service https://blast.ncbi.nlm.nih.gov/blast.cgi EBI BLAST web service https://www.ebi.ac.uk/tools/sss/ FlyBase BLAST (Drosophila and other insects) http://flybase.org/blast/ 3

National Center for Biotechnology Information (NCBI) NCBI BLAST DEMO https://www.ncbi.nlm.nih.gov Key features of NCBI FlyBase - Database for the Drosophila research community Strengths Most comprehensive among publicly available databases PubMed for literature searches Comprehensive BLAST web service Weaknesses Web site is large and complex Quality of GenBank records may vary Use cases Perform BLAST searches against Refseq, nr/nt databases Use BLAST (bl2seq) to align two or more sequences http://flybase.org/ Key Features of FlyBase Lots of ancillary data for each gene in Drosophila Curation of literature for each gene Reference Drosophila annotations for all the other databases (including NCBI) Fast release cycle (6-8 releases per year) Use cases Species-specific BLAST searches Genome browsers (GBrowse/JBrowse), RNA-Seq search, and access datasets for 20 Drosophila species Web databases and tools Many genome databases available Be aware of different annotation releases Use FlyBase as the canonical reference Web databases are being updated constantly Update GEP materials before semester starts Discrepancies in exercise screenshots Minor changes in search results Let us know about errors or revisions 4

Gene Record Finder FLYBASE DEMO http://gander.wustl.edu/~wilson/dmelgenerecord/index.html Key features of the Gene Record Finder List of unique coding and non-coding exons for each gene in D. melanogaster CDS and exon usage maps for each isoform Optimized for exon-by-exon annotation strategy Slower update release cycle than FlyBase Database is updated every semester Use cases: Get amino acid sequences and nucleotide sequences of each exon for BLAST 2 Sequences (bl2seq) searches GENE RECORD FINDER DEMO Summary GEP annotation project seeks to generate high quality manually curated gene models for multiple Drosophila species Questions? Use BLAST to characterize a genomic sequence Use web databases to gather information on a gene UCSC Genome Browser NCBI FlyBase Gene Record Finder https://www.flickr.com/photos/jac_opo/240254763/sizes/l/ 5