Curating sequence and literature data for RefSeq and Gene Kim D. Pruitt 8 th International Biocuration Conference Training workshop April 23, 2015

Size: px
Start display at page:

Download "Curating sequence and literature data for RefSeq and Gene Kim D. Pruitt 8 th International Biocuration Conference Training workshop April 23, 2015"

Transcription

1 Curating sequence and literature data for RefSeq and Gene Kim D. Pruitt 8 th International Biocuration Conference Training workshop April 23, 2015, National Library of Medicine, National Institutes of Health, DHHS, USA

2 RefSeq overview What is RefSeq? How does it compare to GenBank? What are the advantages? How is the dataset built? Curated data Sequence analysis Curation in-depth examples Data access

3 What is RefSeq? An NCBI project to provide reference sequence standards, that incorporate current knowledge, for genomes, transcripts, and proteins. Vertebrates Eukaryotes Prokaryotes Virus Genomes ,000 4,538 Genes 4 million 9.2 million 2 million 200,000 Transcripts 5.6 million 11 million 20,000 na Proteins 4.9 million 10 million 38 million 214,287 Counts taken in early March 2015

4 RefSeq versus GenBank Is archival (member of INSDC) Yes GenBank Source of sequence Submitter Source of annotation Submitter Genome is always annotated No Owner of sequence records and annotation Submitter NCBI staff can update based on user requests Submitter must authorize Annotation may be curated by NCBI staff No RefSeq No GenBank (INSDC) GenBank, Collaboration, Literature, Curation, Computation Yes for archaea, bacteria, eukaryotes NCBI RefSeq may drop contamination RefSeq may add transcript/protein/pseudogene based on data analysis and curation RefSeq may update annotation Yes

5 15 years of building RefSeq Advantages: Consistency Non-redundant Use current names Expanded feature annotation Connected to Gene information Products & Access: Annotated genomes, transcripts, proteins Gene, BLAST, FTP, programming API Curation: Correct errors Add new records Add functional information Connect sequence to function Gene & protein names Functional sequence elements Curation focus Human Mouse Rat Zebrafish Cow Chicken

6 RefSeqs unique contribution for vertebrates Correct transcript/protein sequence even if genome is incomplete/wrong Clear information on data source & evidence Connect DNA<>RNA<>Protein Connect sequence regions to function - for both transcripts and proteins NM_

7 RefSeq Genomes in a Nutshell Submitter GenBank/INSDC Genome Sequence Assembly (Annotate) Submit Nucleotide Assembly Protein BioSample SRA (reads) BioProject Sequence Meta-data Data Submissions BLAST FTP Web eutils Access RefSeq Genome Resources Gene Tracks BLAST FTP RefSeq Creation Annotation Pipeline RefSeq Curation Collaboration RefSeq Process Flows Reports Assembly HomoloGene

8 RefSeq genomes: Leveraging computation & curation Model Organism Databases Nomenclature Groups International CCDS Collaboration UniProtKB/ SwissProt mirbase Genome Reference Consortium (GRC) Quality Checks RefSeqs Curated RefSeqs Iterative process Genes Curation Literature Review Sequence Analysis Annotation Pipeline Align: RefSeq cdnas Proteins RNA-Seq Filter: Best hits Interpret: Build models Call orthologs: vs. human Assign GeneID Assign Accession Public release User Feedback! Iterative process Model RefSeqs Gene FTP Nucleotide Protein

9 Annotation - a conservative approach Annotate every exon that is observed once? X 1. STAG3L5P-PVRIG2P-PILRB readthrough Consolidate information to represent supported genes and transcripts! 2. stromal antigen 3-like 5 pseudogene 3. poliovirus receptor related immunoglobulin domain pseudogene 4. paired immunoglobin-like type 2 receptor beta (regulation of inflammatory responses)

10 Annotation pipeline results in NCBI Gene Access genome annotation information including RNA-Seq tracks Rabbit - GeneID: Assembly: OryCun2.0 Configure Model RefSeqs Not annotated in Ensembl 76 Ensembl track RNA-Seq tracks Interpreted introns Curated Track names Exon coverage Log2 scale graphs

11 How to identify a RefSeq sequence record Keyword: RefSeq Accession format: Two alpha + _+ 6-9 digits or - Two alpha + _ + GenBank accession RefSeq categories (transcripts & proteins): Known RefSeq Subject to curation Accession prefix N*_ Model RefSeq Evidence-based predictions Accession prefix X*_

12 RefSeq overview Curated data Genes Sequence Publications Imported data Sequence analysis Curation in-depth examples Data access

13 BULK PROCESSES CURATION Import Add data from collaborators Review data Gene information Gene-2-sequence associations Publications Data from collaborators Update DB Add, update, remove accessions to match GenBank Resolve Errors Remove wrong name synonyms, publications Fix sequence associations Update gene type Correct collaborator Gene: NCBI Gene associations QA Identify data conflicts for curator review Add data Create RefSeq records RefSeq Attributes & Summary Transcript variant description Alternate names, publications

14 How do we curate? Collaborations Nomenclature, MODs, UniProt, Genome Reference Consortium, individual scientists In-depth sequence analysis Genome, transcript and protein sequence Alignments RNA-Seq QA tests Epigenomics Clinical variants Literature review Vertebrate transcripts Validation Collaboration Sequence Analysis Guidelines Literature Curation mrna, ncrna, protein, and pseudogene records Genome Annotation WWW FTP - BLAST

15 Tracking data & curation consistency Data management Specifications for the product Relational database to track data and curation decisions over time Process flows Data validation Disaster recovery/backup Public access Curation management Standard operating procedures Curation decision trees ncrna <> pseudo <> protein-coding? 5 complete transcript <>partial? Sequence analysis tools and CGI s Support collaborations

16 What do we curate? Genes: Type, location, length Names, Summary Publications Gene-2-accession bins Imported data Sequence: Accuracy, length Alternate splice products Sequence features Functional regions Protein-coding Pseudogene ncrnas Unknown??? RefSeq: National Center for Biotechnology Information Gene:

17 Curating Literature Curation Review for Genes Move to correct gene Add functional citations Mark to include on RefSeq GeneRIF submissions from public Add RefSeq attribute and citation Most publications are added from: National Library of Medicine MeSH indexing service Sequence records Nomenclature groups, MODs, GO, OMIM, GWAS catalog, more

18 GeneRIFs an annotated bibliography RefSeq curators review GeneRIF submissions from individuals to correct spelling, check the gene association, and remove irrelevant submissions.

19 Curation supports data import processes HGNC RGD MGD QTL db OMIM Pseudo geneorg FTP/API Compare to known data Update if OK Generic Processing Dataflow Gene Backend Database XenBase ZFIN MIRBASE CGNC Report for curation if conflicts found

20 Curating data import errors Manually add or update some data HGNC may have: HGNC ID 1 = genome location x = ENSG ID 1 Processing can t identify corresponding GeneID Curator reviews genomic location and either updates or creates a Gene record. Coordinate with data sources to reconcile data association conflicts between sites NCBI may have: Gene ID 1 = HGNC ID 1 = Accession 123 HGNC may have: HGNC ID 1 = Gene ID 1 = Accession 234 NCBI may have: Accession 234 = GeneID 2 = HGNC ID 2 (a paralog)

21 RefSeq overview Curated data Sequence analysis Tools Quality assurance checks Curation in-depth - examples Data access

22 Quick access to stored BLAST results Gene back-end curation database In-house: Set of BLAST searches per accession Results are stored for 3 months Quick access to results UniVec EST NR Genome View hits in NCBI s genome browser Blastn Blastx blastp

23 Sequence and alignment analysis using NCBI s Genome Workbench An application for viewing and analyzing sequence data from NCBI databases, or upload your data for analysis Compiled for several operating systems Analysis: BLAST and more Supports many display options graphical alignments dot plot phylogenetic trees more

24 General layout * * Data display area Project Tree shows loaded data Search for features, search the sequence, search for open reading frames Monitor the progress of analysis tasks

25 Multi-pane cross alignment view Turkey_2.01 Chromosome 1 Turkey_5.0 Chromosome 1

26 Search

27

28 Load a set of protein accession.version numbers Select accessions to include in your analysis Select the analysis option from the Tool menu

29 Load a set of protein accession.version numbers Select accessions to include in your analysis Select analysis option from the Tool menu

30 Display the phylogentic tree calculated from selected CELF proteins.

31 Genome workbench - Multiple protein alignment display Curation use: - Orthology review - Gene type review - Sequence conservation

32 RADAR a Genome Workbench plug-in for RefSeq Curation RefSeq Analysis, Display, and Recommendation New RefSeq QA Strain Library Displays Information on: Genomic region, gene annotation RNA-seq called introns CpG Islands, Repeats, variation, more QA results for newly build RefSeq Aligned RefSeqs, cdnas, ESTs Coding sequence region (green) Strain data Clone library Stored in DB with quality concern (D) Multiple alignments to the genome (M) Consensus splice sites ( a, d ) Mismatches Indels Unaligned ends (not shown)

33 RADAR Functions RNAseq supported intron ORF finder Signal peptides Transmembrane regions Compare/diff transcripts Find similar transcripts Integrated QA tests View nucleotide View translation Links to web for details

34 PROCESS CURATION Import Add data from collaborators Review data Gene information Gene-2-sequence associations Publications Data from collaborators Update DB Add, update, remove accessions to match GenBank Resolve Errors Remove wrong name synonyms, publications Fix sequence associations Update gene type Correct collaborator Gene: NCBI Gene associations QA Identify data conflicts for curator review Add data Create RefSeq records RefSeq Attributes & Summary Transcript variant description Alternate names, publications and GeneRIF

35 Quality assurance tests Transcript tests protein tests genome tests alignment tests Sequence tested Results over time Results summary Tests are available in the NCBI C++ toolkit Details (not shown)

36 RefSeq overview Curated data Sequence analysis Curation in-depth examples Work flow Making decisions Working with collaborators RefSeq curated data is in Gene Annotating RefSeq records Data access

37 General process flow for manual transcript-based curation Identify quality full-length cdnas or ESTs Identify splice variants and assess their protein-coding capacity Extend 5 and 3 ends using all aligning transcript data gt ag gt ag Determine the supported complete CDS Protein-coding variant that encodes an alternate C-terminus Non-coding variant that is subject to nonsense-mediated decay (NMD) AAAAAA Representative RefSeqs AAAAAA AAAAAA NMs AAAAAA NR

38 Transcript-based curation process Example: Human DNAJC22 gene (Gene ID:79962)- RefSeqs are constructed using RADAR NCBI RADAR: NC_ Chromosome 12 GRCh38.p2 (similar to UCSC hg20) Curated NMs are based on fulllength transcripts RNA-seq alignments Chr 12 Known Model UTRs are extended Aligned cdnas Model XMs are created computationally based on transcript and RNA-seq data and often lack full-length support.

39 Determining protein-coding potential of a variant Example: Human CCNO gene (Gene ID: 10309) Three non-coding RefSeq (NRs) were made to represent fulllength transcript variants that either lack an open reading frame (ORF) that meets our quality criteria or the ORF renders the transcript a candidate for nonsense-mediated decay (NMD). NCBI RADAR: NC_ Chromosome 5 GRCh38.p2 (similar to UCSC hg20) protein-coding variant (NM_) non-coding variants (NR_) NMD candidate ORFs are short < 60 aa

40 Detailed documentation improves consistency Protein-coding RNA loci 1 long cdna Or, 2 lines of support: Overlapping partial transcripts + more support Protein homology or ORF conservation or publication Consensus splice sites ORF length >=100 aa If <100 aa require more support Not apparently pseudogene Non-coding RNA loci 1 long cdna if > 2 exons 2 independent lines of support if 2 exons 5 lines of support if 1 exon ORF length <100aa No quality protein hits (blastx) Consensus splice Consider if syntenic region in human, mouse No other data (publication) indicates it is protein-coding 3 end does not correspond to genomic polya

41 Using Epigenomic data to determine 5 completeness Example: mouse Fgd4 gene (Gene ID: ). NCBI RADAR: NC_ Chromosome 1 GRCm38 UCSC Browser H3K4me3 tracks from the UCSC Genome Browser

42 Representing genes based on published data Example: Human APELA gene (Gene ID: ) transcript data supports an independent gene with a short ORF (54 aa) that typically would not meet RefSeq criteria for a protein-coding locus. Literature review confirms the short ORF is functional. Assembly: GRCh38.p2, chromosome 4. NCBI RADAR: NC_ Chromosome 1 GRCh38.p2 54 aa ORF Functional data support the 54 aa ORF

43 Gene type decisions depend on transcript data, epigenomics and functional studies Example: Human FALEC gene (Gene ID: ) Assembly: GRCh38.p2; chromosome 1 NCBI RADAR: NC_ Chromosome 1 GRCh38.p2 (hg20) The locus is supported by a single two-exon EST (AL ) Epigenomic marks support the 5 completeness of the transcripts data UCSC - NC_ Chromosome 1 GRCh37 (hg19) Published data support a functional role for this lncrna

44 Working with nomenclature groups to coordinate changes Example: Non-coding gene LINC00948 was updated to a protein-coding gene MRLN (GeneID: ). Private comments in the in-house Gene database record the curation history Human Annotation Release 107 RefSeq proteins (red)

45 Functional annotation on the RefSeq record Example: Human GHRL gene (Gene ID: 51738) - ghrelin/obestatin prepropeptide GHRL gene AAAAAA Prepro-ghrelin Signal peptide Ghrelin C-Ghrelin pro-ghrelin Ghrelin C-Ghrelin Mature peptides Ghrelin-28 Obestatin

46 GRLH annotation display in NCBI s Gene resource Mature peptides were annotated on protein products of 8 alternatively spliced transcripts (red arrows). The Graphics display shown in NCBI s Gene resource was reconfigured to show all transcripts and proteins, and to show the protein features.

47 Micro RNA annotation collaboration with mirbase mirbase ID: MI Example: Human MIR124-1 (Gene ID: ) Gene Graphics view NCBI imports data directly from mirbase (mirbase.org) NR_ RefSeq represents the mirna stemloop precursor RefSeq annotates the mature micrornas

48 RefSeq record feature annotation for mirnas RefSeq NR_ Human MIR Gene ID:

49 Feature annotation More examples of feature annotation will be provided in Session 1

50 RefSeq collaborates to improve genome annotation GRCh37 Several exons of the Chromosome 7 GRCh37/hg19 NC_ human COPG2 RefSeq were missing in the reference genome assembly. Curators constructed the RefSeq from transcripts and reported the assembly gap to the Genome Reference Chromosome 7 GRCh38/hg20 NC_ Consortium (GRC). GRCh38 The gap is fixed in the updated assembly. RefSeq and Sanger collaborate to produce matching annotation on the new assembly. CCDS The annotated CDS is tracked by the Consensus CDS (CCDS) collaboration once NCBI and Ensembl have both annotated the protein

51 Caution: using RefSeq data from non-ncbi resources NCBI s Graphics Viewer GRCh38/hg20 UCSC s Genome Browser RefSeq Genes track GRCh37/hg19 missing locus missing XM_ variant missing pseudogene locus - Also missing for UCSC GRCh38/hg20

52 RefSeq overview Curated data Sequence analysis Curation in-depth examples Data access

53 Finding RefSeq data in NCBI s Gene resource NCBI s Gene resource is primarily based on RefSeq Gene integrates data from many sources: RefSeq & GeneRIF Official Nomenclature Gene Ontology Orthologs, Pathways, Phenotypes, Variation, Protein interactions, and more Gene provides a unique ID and includes RefSeq details: RefSeq genome annotation RefSeq details including transcript variant descriptions Report of exon coordinates

54 RefSeq data in Gene Genomic regions, transcripts, proteins Find genome annotation datails NCBI Reference Sequences (RefSeqs) Find information for individual accessions

55 Manual curation provides annotation for Gene Example: human GHRL (GeneID:51738) Nomenclature Summary Publications RefSeq transcript variant descriptions

56 Navigating from Gene to Sequence to download

57 Nucleotide & Protein queries Build a query starting with: refseq[filter] Add an organism: AND human[organism] Add a name, a RefSeq attribute, or a specific feature type AND ghrelin-27[protein name] Or AND mat_peptide*feature key+ Or AND obestatin*protein name+ Protein database query example: refseq[filter] AND human[orgn] AND ghrelin-27[protein name] AND mat_peptide[feature key]

58 RefSeq in BLAST

59 Bulk retrievals RefSeq FTP site ftp://ftp.ncbi.nlm.nih.gov/refseq/ Comprehensive bi-monthly release organized by major groups (e.g., vertebrate_mammals, etc.) Weekly updates of transcript/protein records for some organisms Genomes FTP site ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/ Releases of genome assembly and annotation data. Updated to add new file formats, when assembly updates, when there is a major annotation update. Gene FTP site ftp://ftp.ncbi.nlm.nih.gov/gene/ Reports Gene to RefSeq accession associations, and more. NCBI Programming Utilities (eutils) supports scripted retreivals Introduction: Help:

60 User feedback and RefSeq updates Feedback: RefSeq Home page Gene report pages RefSeq Updates: subscribe to the refseq-admin mail list NCBI News

61 Acknowledgements RefSeq Curators (Vertebrates & Other taxa) Stacy Ciufo Eric Cox Diana Haddad Catherine Farrell Tamara Goldfarb Tripti Gupta Vinita Joardar Vamsi Kodali Wenjun Li Kelly McGarvey Mike Murphy Nuala O'Leary Kathleen O Neill Shashi Pujar Bhanu Rajput Sanjida Rangwala NCBI Leadership David Lipman James Ostell Lillian Riddick Barbara Robberts Brian Smith-White Anjana Raina Vatsan Dave Webb Matt Wright Databases & programming Terence Murphy Olga Ermolaeva Craig Wallin Alex Astashyn David Maganadze Mike DiCuccio Andrei Shkeda Donna Maglott Genome Workbench & RADAR Anatoliy Kuznetsov David Falk Andrei Shkeda

This software/database/presentation is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part

This software/database/presentation is a United States Government Work under the terms of the United States Copyright Act. It was written as part This software/database/presentation is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part of the author's official duties as a United States Government

More information

Ensembl workshop. Thomas Randall, PhD bioinformatics.unc.edu. handouts, papers, datasets

Ensembl workshop. Thomas Randall, PhD bioinformatics.unc.edu.   handouts, papers, datasets Ensembl workshop Thomas Randall, PhD tarandal@email.unc.edu bioinformatics.unc.edu www.unc.edu/~tarandal/ensembl handouts, papers, datasets Ensembl is a joint project between EMBL - EBI and the Sanger

More information

Gene-centered resources at NCBI

Gene-centered resources at NCBI COURSE OF BIOINFORMATICS a.a. 2014-2015 Gene-centered resources at NCBI We searched Accession Number: M60495 AT NCBI Nucleotide Gene has been implemented at NCBI to organize information about genes, serving

More information

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers Web resources: NCBI database: http://www.ncbi.nlm.nih.gov/ Ensembl database: http://useast.ensembl.org/index.html UCSC

More information

Aligning GENCODE and RefSeq transcripts By EMBL-EBI and NCBI

Aligning GENCODE and RefSeq transcripts By EMBL-EBI and NCBI Aligning GENCODE and RefSeq transcripts By EMBL-EBI and NCBI Joannella Morales, Ph.D. LRG Project Manager jmorales@ebi.ac.uk contact@lrg-sequence.org https://www.lrg-sequence.org https://www.ensembl.org

More information

Week 1 BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Week 1 BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers Week 1 BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers Web resources: NCBI database: http://www.ncbi.nlm.nih.gov/ Ensembl database: http://useast.ensembl.org/index.html

More information

Guided tour to Ensembl

Guided tour to Ensembl Guided tour to Ensembl Introduction Introduction to the Ensembl project Walk-through of the browser Variations and Functional Genomics Comparative Genomics BioMart Ensembl Genome browser http://www.ensembl.org

More information

BME 110 Midterm Examination

BME 110 Midterm Examination BME 110 Midterm Examination May 10, 2011 Name: (please print) Directions: Please circle one answer for each question, unless the question specifies "circle all correct answers". You can use any resource

More information

Sequence Based Function Annotation

Sequence Based Function Annotation Sequence Based Function Annotation Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University Sequence Based Function Annotation 1. Given a sequence, how to predict its biological

More information

Bacterial Genome Annotation

Bacterial Genome Annotation Bacterial Genome Annotation Bacterial Genome Annotation For an annotation you want to predict from the sequence, all of... protein-coding genes their stop-start the resulting protein the function the control

More information

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence Agenda GEP annotation project overview Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Web databases for Drosophila annotation UCSC Genome Browser NCBI / BLAST FlyBase

More information

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017 Annotation Annotation for D. virilis Chris Shaffer July 2012 l Big Picture of annotation and then one practical example l This technique may not be the best with other projects (e.g. corn, bacteria) l

More information

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018 Annotation Annotation for D. virilis Chris Shaffer July 2012 l Big Picture of annotation and then one practical example l This technique may not be the best with other projects (e.g. corn, bacteria) l

More information

Gene-centered databases and Genome Browsers

Gene-centered databases and Genome Browsers COURSE OF BIOINFORMATICS a.a. 2015-2016 Gene-centered databases and Genome Browsers We searched Accession Number: M60495 AT NCBI Nucleotide Gene has been implemented at NCBI to organize information about

More information

Gene-centered databases and Genome Browsers

Gene-centered databases and Genome Browsers COURSE OF BIOINFORMATICS a.a. 2016-2017 Gene-centered databases and Genome Browsers We searched Accession Number: M60495 AT NCBI Nucleotide Gene has been implemented at NCBI to organize information about

More information

Entrez Gene: gene-centered information at NCBI

Entrez Gene: gene-centered information at NCBI D54 D58 Nucleic Acids Research, 2005, Vol. 33, Database issue doi:10.1093/nar/gki031 Entrez Gene: gene-centered information at NCBI Donna Maglott*, Jim Ostell, Kim D. Pruitt and Tatiana Tatusova National

More information

ab initio and Evidence-Based Gene Finding

ab initio and Evidence-Based Gene Finding ab initio and Evidence-Based Gene Finding A basic introduction to annotation Outline What is annotation? ab initio gene finding Genome databases on the web Basics of the UCSC browser Evidence-based gene

More information

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding UCSC Genome Browser Introduction to ab initio and evidence-based gene finding Wilson Leung 06/2006 Outline Introduction to annotation ab initio gene finding Basics of the UCSC Browser Evidence-based gene

More information

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

TIGR THE INSTITUTE FOR GENOMIC RESEARCH Introduction to Genome Annotation: Overview of What You Will Learn This Week C. Robin Buell May 21, 2007 Types of Annotation Structural Annotation: Defining genes, boundaries, sequence motifs e.g. ORF,

More information

NCBI web resources I: databases and Entrez

NCBI web resources I: databases and Entrez NCBI web resources I: databases and Entrez Yanbin Yin Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1 Homework assignment 1 Two parts: Extract the gene IDs reported in table

More information

Annotation Walkthrough Workshop BIO 173/273 Genomics and Bioinformatics Spring 2013 Developed by Justin R. DiAngelo at Hofstra University

Annotation Walkthrough Workshop BIO 173/273 Genomics and Bioinformatics Spring 2013 Developed by Justin R. DiAngelo at Hofstra University Annotation Walkthrough Workshop NAME: BIO 173/273 Genomics and Bioinformatics Spring 2013 Developed by Justin R. DiAngelo at Hofstra University A Simple Annotation Exercise Adapted from: Alexis Nagengast,

More information

user s guide Question 1

user s guide Question 1 Question 1 How does one find a gene of interest and determine that gene s structure? Once the gene has been located on the map, how does one easily examine other genes in that same region? doi:10.1038/ng966

More information

Genome annotation & EST

Genome annotation & EST Genome annotation & EST What is genome annotation? The process of taking the raw DNA sequence produced by the genome sequence projects and adding the layers of analysis and interpretation necessary

More information

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015 Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH BIOL 7210 A Computational Genomics 2/18/2015 The $1,000 genome is here! http://www.illumina.com/systems/hiseq-x-sequencing-system.ilmn Bioinformatics bottleneck

More information

Identifying Genes and Pseudogenes in a Chimpanzee Sequence Adapted from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. M.

Identifying Genes and Pseudogenes in a Chimpanzee Sequence Adapted from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. M. Identifying Genes and Pseudogenes in a Chimpanzee Sequence Adapted from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. M. Brent Prerequisites: A Simple Introduction to NCBI BLAST Resources: The GENSCAN

More information

Supplementary Online Material. the flowchart of Supplemental Figure 1, with the fraction of known human loci retained

Supplementary Online Material. the flowchart of Supplemental Figure 1, with the fraction of known human loci retained SOM, page 1 Supplementary Online Material Materials and Methods Identification of vertebrate mirna gene candidates The computational procedure used to identify vertebrate mirna genes is summarized in the

More information

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine Bioinformatics Tools Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine Bioinformatics Tools Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine Overview This lecture will

More information

The University of California, Santa Cruz (UCSC) Genome Browser

The University of California, Santa Cruz (UCSC) Genome Browser The University of California, Santa Cruz (UCSC) Genome Browser There are hundreds of available userselected tracks in categories such as mapping and sequencing, phenotype and disease associations, genes,

More information

BIOINFORMATICS FOR DUMMIES MB&C2017 WORKSHOP

BIOINFORMATICS FOR DUMMIES MB&C2017 WORKSHOP Jasper Decuyper BIOINFORMATICS FOR DUMMIES MB&C2017 WORKSHOP MB&C2017 Workshop Bioinformatics for dummies 2 INTRODUCTION Imagine your workspace without the computers Both in research laboratories and in

More information

Applied Bioinformatics

Applied Bioinformatics Applied Bioinformatics Bing Zhang Department of Biomedical Informatics Vanderbilt University bing.zhang@vanderbilt.edu Course overview What is bioinformatics Data driven science: the creation and advancement

More information

Annotation of contig27 in the Muller F Element of D. elegans. Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans.

Annotation of contig27 in the Muller F Element of D. elegans. Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans. David Wang Bio 434W 4/27/15 Annotation of contig27 in the Muller F Element of D. elegans Abstract Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans. Genscan predicted six

More information

Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010

Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010 Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010 Genomics is a new and expanding field with an increasing impact

More information

Chimp BAC analysis: Adapted by Wilson Leung and Sarah C.R. Elgin from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. Michael R.

Chimp BAC analysis: Adapted by Wilson Leung and Sarah C.R. Elgin from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. Michael R. Chimp BAC analysis: Adapted by Wilson Leung and Sarah C.R. Elgin from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. Michael R. Brent Prerequisites: BLAST exercise: Detecting and Interpreting Genetic

More information

Training materials.

Training materials. Training materials - Ensembl training materials are protected by a CC BY license - http://creativecommons.org/licenses/by/4.0/ - If you wish to re-use these materials, please credit Ensembl for their creation

More information

Chapter 2: Access to Information

Chapter 2: Access to Information Chapter 2: Access to Information Outline Introduction to biological databases Centralized databases store DNA sequences Contents of DNA, RNA, and protein databases Central bioinformatics resources: NCBI

More information

ELE4120 Bioinformatics. Tutorial 5

ELE4120 Bioinformatics. Tutorial 5 ELE4120 Bioinformatics Tutorial 5 1 1. Database Content GenBank RefSeq TPA UniProt 2. Database Searches 2 Databases A common situation for alignment is to search through a database to retrieve the similar

More information

Data Retrieval from GenBank

Data Retrieval from GenBank Data Retrieval from GenBank Peter J. Myler Bioinformatics of Intracellular Pathogens JNU, Feb 7-0, 2009 http://www.ncbi.nlm.nih.gov (January, 2007) http://ncbi.nlm.nih.gov/sitemap/resourceguide.html Accessing

More information

Training materials.

Training materials. Training materials Ensembl training materials are protected by a CC BY license http://creativecommons.org/licenses/by/4.0/ If you wish to re-use these materials, please credit Ensembl for their creation

More information

Bioinformatics for Proteomics. Ann Loraine

Bioinformatics for Proteomics. Ann Loraine Bioinformatics for Proteomics Ann Loraine aloraine@uab.edu What is bioinformatics? The science of collecting, processing, organizing, storing, analyzing, and mining biological information, especially data

More information

NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy

NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy D130 D135 Nucleic Acids Research, 2012, Vol. 40, Database issue Published online 24 November 2011 doi:10.1093/nar/gkr1079 NCBI Reference Sequences (RefSeq): current status, new features and genome annotation

More information

Array-Ready Oligo Set for the Rat Genome Version 3.0

Array-Ready Oligo Set for the Rat Genome Version 3.0 Array-Ready Oligo Set for the Rat Genome Version 3.0 We are pleased to announce Version 3.0 of the Rat Genome Oligo Set containing 26,962 longmer probes representing 22,012 genes and 27,044 gene transcripts.

More information

Chimp Sequence Annotation: Region 2_3

Chimp Sequence Annotation: Region 2_3 Chimp Sequence Annotation: Region 2_3 Jeff Howenstein March 30, 2007 BIO434W Genomics 1 Introduction We received region 2_3 of the ChimpChunk sequence, and the first step we performed was to run RepeatMasker

More information

Aaditya Khatri. Abstract

Aaditya Khatri. Abstract Abstract In this project, Chimp-chunk 2-7 was annotated. Chimp-chunk 2-7 is an 80 kb region on chromosome 5 of the chimpanzee genome. Analysis with the Mapviewer function using the NCBI non-redundant database

More information

Investigation of Genomic Variation in the Rising Era of Individual Genome Sequence: A Primer on Some Available Datasets and Structures

Investigation of Genomic Variation in the Rising Era of Individual Genome Sequence: A Primer on Some Available Datasets and Structures Investigation of Genomic Variation in the Rising Era of Individual Genome Sequence: A Primer on Some Available Datasets and Structures September 28, 2015 A 10,000 Foot View Genomics Data at NCBI Organizational

More information

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018 Outline Overview of the GEP annotation projects Annotation of Drosophila Primer January 2018 GEP annotation workflow Practice applying the GEP annotation strategy Wilson Leung and Chris Shaffer AAACAACAATCATAAATAGAGGAAGTTTTCGGAATATACGATAAGTGAAATATCGTTCT

More information

Types of Databases - By Scope

Types of Databases - By Scope Biological Databases Bioinformatics Workshop 2009 Chi-Cheng Lin, Ph.D. Department of Computer Science Winona State University clin@winona.edu Biological Databases Data Domains - By Scope - By Level of

More information

Transcription Start Sites Project Report

Transcription Start Sites Project Report Transcription Start Sites Project Report Student name: Student email: Faculty advisor: College/university: Project details Project name: Project species: Date of submission: Number of genes in project:

More information

Small Exon Finder User Guide

Small Exon Finder User Guide Small Exon Finder User Guide Author Wilson Leung wleung@wustl.edu Document History Initial Draft 01/09/2011 First Revision 08/03/2014 Current Version 12/29/2015 Table of Contents Author... 1 Document History...

More information

Introduction to RNA-Seq in GeneSpring NGS Software

Introduction to RNA-Seq in GeneSpring NGS Software Introduction to RNA-Seq in GeneSpring NGS Software Dipa Roy Choudhury, Ph.D. Strand Scientific Intelligence and Agilent Technologies Learn more at www.genespring.com Introduction to RNA-Seq In a few years,

More information

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences. Bio4342 Exercise 1 Answers: Detecting and Interpreting Genetic Homology (Answers prepared by Wilson Leung) Question 1: Low complexity DNA can be described as sequences that consist primarily of one or

More information

NCBI Reference Sequences: current status, policy and new initiatives

NCBI Reference Sequences: current status, policy and new initiatives D32 D36 Nucleic Acids Research, 2009, Vol. 37, Database issue Published online 16 October 2008 doi:10.1093/nar/gkn721 NCBI Reference Sequences: current status, policy and new initiatives Kim D. Pruitt*,

More information

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica The Ensembl Database Dott.ssa Inga Prokopenko Corso di Genomica 1 www.ensembl.org Lecture 7.1 2 What is Ensembl? Public annotation of mammalian and other genomes Open source software Relational database

More information

Web-based tools for Bioinformatics; A (free) introduction to (freely available) NCBI, MUSC and World-wide.

Web-based tools for Bioinformatics; A (free) introduction to (freely available) NCBI, MUSC and World-wide. Page 1 of 18 Web-based tools for Bioinformatics; A (free) introduction to (freely available) NCBI, MUSC and World-wide. When and Where---Wednesdays 1-2pm Room 438 Library Admin Building Beginning September

More information

Browser Exercises - I. Alignments and Comparative genomics

Browser Exercises - I. Alignments and Comparative genomics Browser Exercises - I Alignments and Comparative genomics 1. Navigating to the Genome Browser (GBrowse) Note: For this exercise use http://www.tritrypdb.org a. Navigate to the Genome Browser (GBrowse)

More information

NCBI Molecular Biology Resources. Entrez & BLAST. Entrez: Database Integration. Database Searching with Entrez. WWW Access. Using Entrez.

NCBI Molecular Biology Resources. Entrez & BLAST. Entrez: Database Integration. Database Searching with Entrez. WWW Access. Using Entrez. NCBI Molecular Biology Resources Using Entrez WWW Access Entrez & BLAST March 2007 Phylogeny Entrez: Database Integration Taxonomy PubMed abstracts Genomes Word weight 3-D Structure VAST Neighbors Related

More information

Last Update: 12/31/2017. Recommended Background Tutorial: An Introduction to NCBI BLAST

Last Update: 12/31/2017. Recommended Background Tutorial: An Introduction to NCBI BLAST BLAST Exercise: Detecting and Interpreting Genetic Homology Adapted by T. Cordonnier, C. Shaffer, W. Leung and SCR Elgin from Detecting and Interpreting Genetic Homology by Dr. J. Buhler Recommended Background

More information

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions Outline Introduction to ab initio and evidence-based gene finding Overview of computational gene predictions Different types of eukaryotic gene predictors Common types of gene prediction errors Wilson

More information

Agenda. Annotation of Drosophila. Muller element nomenclature. Annotation: Adding labels to a sequence. GEP Drosophila annotation projects 01/03/2018

Agenda. Annotation of Drosophila. Muller element nomenclature. Annotation: Adding labels to a sequence. GEP Drosophila annotation projects 01/03/2018 Agenda Annotation of Drosophila January 2018 Overview of the GEP annotation project GEP annotation strategy Types of evidence Analysis tools Web databases Annotation of a single isoform (walkthrough) Wilson

More information

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks Introduction to Bioinformatics CPSC 265 Thanks to Jonathan Pevsner, Ph.D. Textbooks Johnathan Pevsner, who I stole most of these slides from (thanks!) has written a textbook, Bioinformatics and Functional

More information

Accurate & Complete Gene Construction with EvidentialGene. eugenes.org/evidentialgene/ 2016 June

Accurate & Complete Gene Construction with EvidentialGene. eugenes.org/evidentialgene/ 2016 June Accurate & Complete Gene Construction with EvidentialGene eugenes.org/evidentialgene/ Don Gilbert 2016 June gilbertd@indiana.edu What is EvidentialGene? Classifier of gene models Class = good, alternate,

More information

Why learn sequence database searching? Searching Molecular Databases with BLAST

Why learn sequence database searching? Searching Molecular Databases with BLAST Why learn sequence database searching? Searching Molecular Databases with BLAST What have I cloned? Is this really!my gene"? Basic Local Alignment Search Tool How BLAST works Interpreting search results

More information

Annotation of a Drosophila Gene

Annotation of a Drosophila Gene Annotation of a Drosophila Gene Wilson Leung Last Update: 12/30/2018 Prerequisites Lecture: Annotation of Drosophila Lecture: RNA-Seq Primer BLAST Walkthrough: An Introduction to NCBI BLAST Resources FlyBase:

More information

Protein Bioinformatics Part I: Access to information

Protein Bioinformatics Part I: Access to information Protein Bioinformatics Part I: Access to information 260.655 April 6, 2006 Jonathan Pevsner, Ph.D. pevsner@kennedykrieger.org Outline [1] Proteins at NCBI RefSeq accession numbers Cn3D to visualize structures

More information

Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer. Project XX1001. Customer Detail

Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer. Project XX1001. Customer Detail Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer Project XX Customer Detail Table of Contents. Bioinformatics analysis pipeline...3.. Read quality check. 3.2. Read alignment...3.3.

More information

user s guide Question 3

user s guide Question 3 Question 3 During a positional cloning project aimed at finding a human disease gene, linkage data have been obtained suggesting that the gene of interest lies between two sequence-tagged site markers.

More information

Lecture 7 Motif Databases and Gene Finding

Lecture 7 Motif Databases and Gene Finding Introduction to Bioinformatics for Medical Research Gideon Greenspan gdg@cs.technion.ac.il Lecture 7 Motif Databases and Gene Finding Motif Databases & Gene Finding Motifs Recap Motif Databases TRANSFAC

More information

The human gene encoding Glucose-6-phosphate dehydrogenase (G6PD) is located on chromosome X in cytogenetic band q28.

The human gene encoding Glucose-6-phosphate dehydrogenase (G6PD) is located on chromosome X in cytogenetic band q28. Data mining in Ensembl with BioMart Worked Example The human gene encoding Glucose-6-phosphate dehydrogenase (G6PD) is located on chromosome X in cytogenetic band q28. Which other genes related to human

More information

Annotation. (Chapter 8)

Annotation. (Chapter 8) Annotation (Chapter 8) Genome annotation Genome annotation is the process of attaching biological information to sequences: identify elements on the genome attach biological information to elements store

More information

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University Sequence Based Function Annotation Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University Usage scenarios for sequence based function annotation Function prediction of newly cloned

More information

In silico variant analysis: Challenges and Pitfalls

In silico variant analysis: Challenges and Pitfalls In silico variant analysis: Challenges and Pitfalls Fiona Cunningham Variation annotation coordinator EMBL-EBI www.ensembl.org Sequencing -> Variants -> Interpretation Structural variants SNP? In-dels

More information

BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology

BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology Jeremy Buhler March 15, 2004 In this lab, we ll annotate an interesting piece of the D. melanogaster genome. Along the way, you ll get

More information

A Prac'cal Guide to NCBI BLAST

A Prac'cal Guide to NCBI BLAST A Prac'cal Guide to NCBI BLAST Leonardo Mariño-Ramírez NCBI, NIH Bethesda, USA June 2018 1 NCBI Search Services and Tools Entrez integrated literature and molecular databases Viewers BLink protein similarities

More information

Introduction to the UCSC genome browser

Introduction to the UCSC genome browser Introduction to the UCSC genome browser Dominik Beck NHMRC Peter Doherty and CINSW ECR Fellow, Senior Lecturer Lowy Cancer Research Centre, UNSW and Centre for Health Technology, UTS SYDNEY NSW AUSTRALIA

More information

NCBI Molecular Biology Resources

NCBI Molecular Biology Resources NCBI Molecular Biology Resources Part 2: Using NCBI BLAST December 2009 Using BLAST Basics of using NCBI BLAST Using the new Interface Improved organism and filter options New Services Primer BLAST Align

More information

Chimp Chunk 3-14 Annotation by Matthew Kwong, Ruth Howe, and Hao Yang

Chimp Chunk 3-14 Annotation by Matthew Kwong, Ruth Howe, and Hao Yang Chimp Chunk 3-14 Annotation by Matthew Kwong, Ruth Howe, and Hao Yang Ruth Howe Bio 434W April 1, 2010 INTRODUCTION De novo annotation is the process by which a finished genomic sequence is searched for

More information

Two Mark question and Answers

Two Mark question and Answers 1. Define Bioinformatics Two Mark question and Answers Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline. There are three

More information

Niemann-Pick Type C Disease Gene Variation Database ( )

Niemann-Pick Type C Disease Gene Variation Database (   ) NPC-db (vs. 1.1) User Manual An introduction to the Niemann-Pick Type C Disease Gene Variation Database ( http://npc.fzk.de ) curated 2007/2008 by Dirk Dolle and Heiko Runz, Institute of Human Genetics,

More information

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence Annotating 7G24-63 Justin Richner May 4, 2005 Zfh2 exons Thd1 exons Pur-alpha exons 0 40 kb 8 = 1 kb = LINE, Penelope = DNA/Transib, Transib1 = DINE = Novel Repeat = LTR/PAO, Diver2 I = LTR/Gypsy, Invader

More information

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G Introduction: A genome is the total genetic content of

More information

Ensembl: A New View of Genome Browsing

Ensembl: A New View of Genome Browsing 28 TECHNICAL NOTES EMBnet.news 15.3 Ensembl: A New View of Genome Browsing Giulietta M. Spudich and Xosé M. Fernández- Suárez European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxon, Cambs,

More information

CAP 5510/CGS 5166: Bioinformatics & Bioinformatic Tools GIRI NARASIMHAN, SCIS, FIU

CAP 5510/CGS 5166: Bioinformatics & Bioinformatic Tools GIRI NARASIMHAN, SCIS, FIU CAP 5510/CGS 5166: Bioinformatics & Bioinformatic Tools GIRI NARASIMHAN, SCIS, FIU !2 Sequence Alignment! Global: Needleman-Wunsch-Sellers (1970).! Local: Smith-Waterman (1981) Useful when commonality

More information

NCBI & Other Genome Databases. BME 110/BIOL 181 CompBio Tools

NCBI & Other Genome Databases. BME 110/BIOL 181 CompBio Tools NCBI & Other Genome Databases BME 110/BIOL 181 CompBio Tools Todd Lowe March 31, 2011 Admin Reading Dummies Ch 3 Assigned Review: "The impact of next-generation sequencing technology on genetics" by E.

More information

Gene Identification in silico

Gene Identification in silico Gene Identification in silico Nita Parekh, IIIT Hyderabad Presented at National Seminar on Bioinformatics and Functional Genomics, at Bioinformatics centre, Pondicherry University, Feb 15 17, 2006. Introduction

More information

Identification of individual motifs on the genome scale. Some slides are from Mayukh Bhaowal

Identification of individual motifs on the genome scale. Some slides are from Mayukh Bhaowal Identification of individual motifs on the genome scale Some slides are from Mayukh Bhaowal Two papers Nature 423, 241-254 (15 May 2003) Sequencing and comparison of yeast species to identify genes and

More information

Hands-On Four Investigating Inherited Diseases

Hands-On Four Investigating Inherited Diseases Hands-On Four Investigating Inherited Diseases The purpose of these exercises is to introduce bioinformatics databases and tools. We investigate an important human gene and see how mutations give rise

More information

The i5k a pan-arthropoda Genome Database. Chris Childers and Monica Poelchau USDA-ARS, National Agricultural Library

The i5k a pan-arthropoda Genome Database. Chris Childers and Monica Poelchau USDA-ARS, National Agricultural Library The i5k Workspace@NAL: a pan-arthropoda Genome Database Chris Childers and Monica Poelchau USDA-ARS, National Agricultural Library Outline Background and overview Why join the i5k Workspace? What do we

More information

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science EECS 730 Introduction to Bioinformatics Sequence Alignment Luke Huan Electrical Engineering and Computer Science http://people.eecs.ku.edu/~jhuan/ Database What is database An organized set of data Can

More information

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013)

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013) Genome annotation Erwin Datema (2011) Sandra Smit (2012, 2013) Genome annotation AGACAAAGATCCGCTAAATTAAATCTGGACTTCACATATTGAAGTGATATCACACGTTTCTCTAAT AATCTCCTCACAATATTATGTTTGGGATGAACTTGTCGTGATTTGCCATTGTAGCAATCACTTGAA

More information

Genome Annotation Genome annotation What is the function of each part of the genome? Where are the genes? What is the mrna sequence (transcription, splicing) What is the protein sequence? What does

More information

Bioinformatics for Cell Biologists

Bioinformatics for Cell Biologists Bioinformatics for Cell Biologists 15 19 March 2010 Developmental Biology and Regnerative Medicine (DBRM) Schedule Monday, March 15 09.00 11.00 Introduction to course and Bioinformatics (L1) D224 Helena

More information

Tutorial section. VEGA, the genome browser with a difference

Tutorial section. VEGA, the genome browser with a difference VEGA, the genome browser with a difference Keywords: vertebrate, annotation, database, manual, curation Abstract The Vertebrate Genome Annotation (Vega) database is a community resource for browsing manual

More information

Gene Signal Estimates from Exon Arrays

Gene Signal Estimates from Exon Arrays Gene Signal Estimates from Exon Arrays I. Introduction: With exon arrays like the GeneChip Human Exon 1.0 ST Array, researchers can examine the transcriptional profile of an entire gene (Figure 1). Being

More information

Annotating your variants: Ensembl Variant Effect Predictor (VEP) Helen Sparrow Ensembl EMBL-EBI 2nd November 2016

Annotating your variants: Ensembl Variant Effect Predictor (VEP) Helen Sparrow Ensembl EMBL-EBI 2nd November 2016 Training materials Ensembl training materials are protected by a CC BY license http://creativecommons.org/licenses/by/4.0/ If you wish to re-use these materials, please credit Ensembl for their creation

More information

A tutorial introduction into the MIPS PlantsDB barley&wheat database instances

A tutorial introduction into the MIPS PlantsDB barley&wheat database instances transplant 2 nd user training workshop Poznan, Poland, June, 27 th, 2013 A tutorial introduction into the MIPS PlantsDB barley&wheat database instances TUTORIAL ANSWERS Please direct any questions related

More information

FUNCTIONAL BIOINFORMATICS

FUNCTIONAL BIOINFORMATICS Molecular Biology-2018 1 FUNCTIONAL BIOINFORMATICS PREDICTING THE FUNCTION OF AN UNKNOWN PROTEIN Suppose you have found the amino acid sequence of an unknown protein and wish to find its potential function.

More information

Investigating Inherited Diseases

Investigating Inherited Diseases Investigating Inherited Diseases The purpose of these exercises is to introduce bioinformatics databases and tools. We investigate an important human gene and see how mutations give rise to inherited diseases.

More information

Gene Annotation Project. Group 1. Tyler Tiede Yanzhu Ji Jenae Skelton

Gene Annotation Project. Group 1. Tyler Tiede Yanzhu Ji Jenae Skelton Gene Annotation Project Group 1 Tyler Tiede Yanzhu Ji Jenae Skelton Outline Tools Overview of 150kb region Overview of annotation process Characterization of 5 putative gene regions Analysis of masked

More information

Transcriptome Assembly, Functional Annotation (and a few other related thoughts)

Transcriptome Assembly, Functional Annotation (and a few other related thoughts) Transcriptome Assembly, Functional Annotation (and a few other related thoughts) Monica Britton, Ph.D. Sr. Bioinformatics Analyst June 23, 2017 Differential Gene Expression Generalized Workflow File Types

More information

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl) Protein Sequence Analysis BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl) Linear Sequence Analysis What can you learn from a (single) protein sequence? Calculate it s physical

More information

Computational Biology and Bioinformatics

Computational Biology and Bioinformatics Computational Biology and Bioinformatics Computational biology Development of algorithms to solve problems in biology Bioinformatics Application of computational biology to the analysis and management

More information