Using the Potato Genome Sequence! Robin Buell! Michigan State University! Department of Plant Biology! August 15, 2010!

Similar documents
Browser Exercises - I. Alignments and Comparative genomics

user s guide Question 3

Genomic resources. for non-model systems

NGS developments in tomato genome sequencing

user s guide Question 3

Web-based tools for Bioinformatics; A (free) introduction to (freely available) NCBI, MUSC and World-wide.

Annotation Walkthrough Workshop BIO 173/273 Genomics and Bioinformatics Spring 2013 Developed by Justin R. DiAngelo at Hofstra University

Genome Sequencing-- Strategies

GREG GIBSON SPENCER V. MUSE

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

BMC Genomics. Sample. doi: /s

The Diploid Genome Sequence of an Individual Human

Bionano Access v1.2 Release Notes

Contact us for more information and a quotation

Sequencing and assembly of the sheep genome reference sequence

Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010

The Genome Analysis Centre. Building Excellence in Genomics and Computational Bioscience

Identifying Regulatory Regions using Multiple Sequence Alignments

Chapter 2: Access to Information

Why can GBS be complicated? Tools for filtering & error correction. Edward Buckler USDA-ARS Cornell University

Why can GBS be complicated? Tools for filtering, error correction and imputation.

Sequencing the genomes of Nicotiana sylvestris and Nicotiana tomentosiformis Nicolas Sierro

SeattleSNPs Interactive Tutorial: Database Inteface Entrez, dbsnp, HapMap, Perlegen

FINDING GENES AND EXPLORING THE GENE PAGE AND RUNNING A BLAST (Exercise 1)

Basic Bioinformatics: Homology, Sequence Alignment,

NCBI web resources I: databases and Entrez

Data Retrieval from GenBank

Genomics AGRY Michael Gribskov Hock 331

Marker types. Potato Association of America Frederiction August 9, Allen Van Deynze

Transcriptome Assembly, Functional Annotation (and a few other related thoughts)

SolCAP. Executive Commitee : David Douches Walter De Jong Robin Buell David Francis Alexandra Stone Lukas Mueller AllenVan Deynze

Runs of Homozygosity Analysis Tutorial

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

Exercise I, Sequence Analysis

Week 1 BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Next Generation Genetics: Using deep sequencing to connect phenotype to genotype

Genome Assembly With Next Generation Sequencers

Usage Cases of GBS. Jeff Glaubitz Senior Research Associate, Buckler Lab, Cornell University Panzea Project Manager

Genome Projects. Part III. Assembly and sequencing of human genomes

Bioinformatics Course AA 2017/2018 Tutorial 2

This software/database/presentation is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part

DNA. bioinformatics. genomics. personalized. variation NGS. trio. custom. assembly gene. tumor-normal. de novo. structural variation indel.

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

Chapter 5. Structural Genomics

Next Generation Sequences & Chloroplast Assembly. 8 June, 2012 Jongsun Park

High throughput omics and BIOINFORMATICS

Introduction to Bioinformatics

Anchoring and ordering NGS contig assemblies by population sequencing (POPSEQ)

Introduction to Plant Genomics and Online Resources. Manish Raizada University of Guelph

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

Bioinformatics for Proteomics. Ann Loraine

A Prac'cal Guide to NCBI BLAST

user s guide Question 1

Experimental Design Microbial Sequencing

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica

The international effort to sequence the 17Gb wheat genome: Yes, Wheat can!

Finishing Drosophila ananassae Fosmid 2410F24

Biology 644: Bioinformatics

Biol 478/595 Intro to Bioinformatics

ELE4120 Bioinformatics. Tutorial 5

Finding Genes, Building Search Strategies and Visiting a Gene Page

Finding Genes, Building Search Strategies and Visiting a Gene Page


Ensembl workshop. Thomas Randall, PhD bioinformatics.unc.edu. handouts, papers, datasets

Why learn sequence database searching? Searching Molecular Databases with BLAST

BME 110 Midterm Examination

SCSC, GENE, MEPS and BIOT 654: Analysis of Complex Genomes (Lec) Spring 2018

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Prioritization: from vcf to finding the causative gene

DE NOVO WHOLE GENOME ASSEMBLY AND SEQUENCING OF THE SUPERB FAIRYWREN. (Malurus cyaneus) JOSHUA PEÑALBA LEO JOSEPH CRAIG MORITZ ANDREW COCKBURN

BENG 183 Trey Ideker. Genome Assembly and Physical Mapping

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

De novo assembly in RNA-seq analysis.

CSE/Beng/BIMM 182: Biological Data Analysis. Instructor: Vineet Bafna TA: Nitin Udpa

Bionano Access 1.0 Software User Guide

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Number and length distributions of the inferred fosmids.

BLASTing through the kingdom of life

A tutorial introduction into the MIPS PlantsDB barley&wheat database instances

StarGenetics User Guide

MODULE TSS2: SEQUENCE ALIGNMENTS (ADVANCED)

Introduction to NGS analyses

Course summary. Today. PCR Polymerase chain reaction. Obtaining molecular data. Sequencing. DNA sequencing. Genome Projects.

Introduction to RNA-Seq in GeneSpring NGS Software

Genome evolution on the allotetraploid Xenopus laevis

Supplemental Figure Legends

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018

Mate-pair library data improves genome assembly

N50 must die!? Genome assembly workshop, Santa Cruz, 3/15/11

The Bioluminescence Heterozygous Genome Assembler

High quality reference genome of the domestic sheep (Ovis aries) Yu Jiang and Brian P. Dalrymple

SNP calling and VCF format

A tutorial introduction into the MIPS PlantsDB barley&wheat databases. Manuel Spannagl&Kai Bader transplant user training Poznan June 2013

Fruit and Nut Trees Genomics and Quantitative Genetics

Identifying Genes and Pseudogenes in a Chimpanzee Sequence Adapted from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. M.

De Novo Assembly of High-throughput Short Read Sequences

CrusView is a tool for karyotype/genome visualization and comparison of crucifer species. It also provides functions to import new genomes.

Small Exon Finder User Guide

B) You can conclude that A 1 is identical by descent. Notice that A2 had to come from the father (and therefore, A1 is maternal in both cases).

Next Genera*on Sequencing II: Personal Genomics. Jim Noonan Department of Gene*cs

Transcription:

Using the Potato Genome Sequence! Robin Buell! Michigan State University! Department of Plant Biology! August 15, 2010! buell@msu.edu! 1

Whole Genome Shotgun Sequencing 2

New Technologies Revolutionize Sequencing -Very high throughput -Very inexpensive -Usher in era of personal genomics & post-genomic biology 2010 2002 genomes genera 3

So, you say you can sequence-now what? 4

Assemble Fragments SEQUENCER OUTPUT OF RANDOM FRAGMENTS AGCTCGCTAGCTA CTCGCTAGCTAG Gene 1 Gene 2 Gene 3 TAGCTAGC AGCTAGGCTC CTAGCTAGCTAGGCTC AGCTAGC AGCTCGCTA Annotate GCTAGCTAGC ASSEMBLE FRAGMENTS INTO A CONSENUS SEQUENCE *Using Computer AGCTCGCTAGCTAGCTAGCTAGCTAGGCTC GCTAGCTAGC AGCTCGCTAGCTA TAGCTAGC TAGCTAGCTA AGCTCGCTA GCTAGCTAGCT CTCGCTAGCTAG AGCTAGC CTAGCTAGCTAGGCTC AGCTAGGCTC 5

Participants have their own grants and financing Data are freely available US funding through National Science Foundation 6

With so many potatoes with lots of variation-what should be sequenced? Darth Tater 7

RH89-039- 16 (RH): a diploid heterozygous genotype genetic map (SH x RH), >10,000 markers Least heterozygous parent Physical map Sanger sequencing; BAC- by- BAC strategy ~6,000 BACs for full coverage 8

RHPOTKEY BAC library (78000 clones; 9-10 g.e.) Library clones fingerprinted with AFLP BAC fingerprints aligned into 6400 contigs 1600 BAC contigs anchored to RH AFLP map 9

Approx. 2000 BACs have been sequenced Chromosome 5: ~80% Chromosomes 1, 6 & 9: ~30% Relatively short tiling paths for some LGs Issues due to heterozygosity Slow and uneven progress WGS using NextGen Sequencing? 10

Initial Strategy heterozygous clone (RH89-039- 16) Contig assembly issues 2 divergent haplotypes Revised Strategy (2008 onwards) homozygous genotype (DM1-3 516R44) Reduced assembly issues 1 haplotype 0 1 0 0 11

Doubled monoploid line DM 1-3 516 R44 of adapted Solanum tuberosum Group Phureja (from Richard Veilleux, Virginia Tech, USA) Reduced complexity for whole genome shotgun sequencing due to homozygosity Taxonomic study (Spooner et al. 2007) suggest it is same species as S. tuberosum Very slow growing, presumably due to increased genetic load caused by exposure of inferior alleles to environment and homozygosity 12

Whole Genome Shotgun of two genotypes - RH89-039- 16 (RH) diploid heterozygote - DM1-3 516R44 (DM) diploid homozygote Illumina short read + Roche WGS RNA seq: transcriptome resource For DM; BAC end and Fosmid end sequencing (Sanger)long- range scaffolding) 13

Genome estimated to be ~850 Million bases Assembled size ~730 Mb QC on assembly suggests it is of high quality Compare DM BAC sequences with assembly Also use paired end sequence Assembly v3 looks good 14

PGSC Mapping group several partners mapping assembly to new map using different sequence- based marker types: SNP, SSR, DArT In silico anchoring using RH WGP, PoMaMo & SGN maps Target: - >90% of assembly anchored to genetic map 15

16

What are we interested in annotating? Genes where, what, when -Annotated ~40,000 genes -Used deep transcriptome sequencing (45 libraries from RH and DM) to annotate genes and determine expression profiling patterns -In the process of refining the annotation; some made available now 17

Still in the process of fixing some assembly and annotation issues 18

19

Using the potato genome sequence! Access: http://www.potatogenome.net/ Agree to the Data Access Agreement -BLAST against your query sequence -Download the mfasta file of scaffolds -View genome through the Genome Browser 20

In Class Exercise Reads > Contigs/Scaffolds (PGSC0003DMS) > Super Contigs/Super Scaffolds (PGSC0003DMB) http://www.potatogenome.net Intro to PGSC Link to Data http://potatogenomics.plantbiology.msu.edu Test Sequence: Rubisco: GenBank Accession # J03613.1 Google ncbi entrez http://www.ncbi.nlm.nih.gov/sites/gquery?itool=toolbar Download (or copy) as a fasta formatted sequence

Lets BLAST this gene against v3 of the DM assembly Go to BLAST page: Paste sequence, Select blastn Get alignment hits (PGSC0003DMS000001195 Length = 311,235); look at the alignment (see gapped alignment) 1 180 265130 265309 175 315 265389 *Note there is a paralog present in the DM genome (second best hit) 314 546 265529 265611 265843 Find this scaffold on the Genome Browser. NOTE THE GENOME BROWSER IS SUPERSCAFFOLD (SUPERCONTIG) based. Paste PGSC0003DMS000001195 in the Landmark box, hit return Zoom out to 1 MB to get a perspective of this scaffold/contig to other scaffolds/contigs NOTE THE SCAFFOLDS/CONTIGS CAN BE PLACED IN EITHER ORIENTATION IN SUPERSCAFFOLD/SUPERCONTIG Zoom in on PGSC0003DMS000001195; zoom in on 260-270kb region or 331-260 kb region

Using the Potato Genome Browser Instructions Panel: Bookmark, Hide Banner, High Resolution image, RESET Search: Landmark or Region Use Scaffold name Scroll/Zoom: View selection box Move to the left either 50 or 100% Move to the right either 50 or 100% Zoom in/out 10% Flip sequence Overview: Select region to view using the rubberband Tracks: Select which tracks to view; Update Image Configure track order, color, etc Display Settings: Show tracks Show tooltips Track Name 23

BLAST sequence search tool to identify sequences via sequence similarity: Step 1: Go to the PGSC BLAST site at http://potatogenomics.plantbiology.msu.edu/index.php?p=blast Step 2: Select the type of search that you which to use. Note that only BLASTN, TBLASTN, and TBLASTX is supported Step 3: Paste your favorite sequence into the search box in the FASTA format Step 4: Select the database you wish to search. The potato genome sequence is Solanum phureja scaffolds v3. Also provided are databases of BAC and BAC end sequences from S. phureja and S. tuberosum as well as transcript (PUTs) assemblies of potato from the ISU PlantGDB project (plantgdb.org). 24

Step 5: Submit your sequence for a BLAST search. An intermediate page will appear telling you that your search is in progress and that the results will be held for 15 minutes via a specific URL. In your BLAST results, the DM sequence is represented as scaffolds. A sample scaffold is listed below: PGSC0003DMS000000150 PGSC0003DM: denotes the PGSC version 3 assembly S: Scaffold 000000150: Unique identifier Step 6: A link is available that allows you to download your scaffold sequence(s) of interest directly from the BLAST report. In the table of hits, simply click on the scaffold accession you wish to download and you will be presented with the PGSC data access agreement. After you accept the terms of the agreement, the scaffold file will be retrieved and packaged, and you should be prompted by your browser to save the file. Note that the DM 1-3 scaffold sequences are being made available under the terms of the PGSC data access agreement, so you must read and agree to these terms before downloading the full scaffold database. 25

Download of the potato genome sequence While the BLAST site will assist in identifying your sequence within the PGSC DM genome assembly, you will need to download the sequence from the PGSC web site to access the scaffold and genome sequence. Step 1: Go to http://potatogenomics.plantbiology.msu.edu/index.php? p=download Step 2: To download the PGSC DM scaffolds, select Solanum_phureja.DM.scaffolds-v3.tar.bz2 Step 3: Read and if you agree to the data access agreement, click on Yes, I agree to these terms Step 4: The sequence databases are packaged using the Tar (http:// www.gnu.org/software/tar/) archiver, and then compressed using the bzip2 (http://www.bzip.org/) compression software. These programs are generally available on a linux machine; on a Windows machine, a number of applications are available that should be capable of extracting a bzip2- compressed tar file, including WinZip, WinRAR, and WinAce. Note: This file will be LARGE (185 Mb) and will take sometime to download. 26

Step 5: In the uncompressed file will be: README: A description of the DM Scaffolds Data_access: Statement of data access agreement PGSC0003DMS.fa: Multi-fasta file of the scaffolds Step 6: How to retrieve a specific sequences from the multi-fasta file. You can use any text editor that is capable of opening large files and doing a text search, for example 'vim' in Linux (http://blog.interlinked.org/tutorials/ vim_tutorial.html ) or 'vim' in Windows( http://www.vim.org/ download.php#pc), 'textedit' on a Mac, or 'wordpad' on Windows. Better still, there are a number of utilities available for retrieving individual records from a fasta sequence database. The NCBI BLAST package has a utility called 'fastacmd' that serves this purpose, the equivalent utility in the WUBLAST package is called 'xdget'. Other tools are available with packages such as EMBOSS or exonerate that will also allow you to index and fetch sequences from a fasta database. 27