Using the Potato Genome Sequence! Robin Buell! Michigan State University! Department of Plant Biology! August 15, 2010! buell@msu.edu! 1
Whole Genome Shotgun Sequencing 2
New Technologies Revolutionize Sequencing -Very high throughput -Very inexpensive -Usher in era of personal genomics & post-genomic biology 2010 2002 genomes genera 3
So, you say you can sequence-now what? 4
Assemble Fragments SEQUENCER OUTPUT OF RANDOM FRAGMENTS AGCTCGCTAGCTA CTCGCTAGCTAG Gene 1 Gene 2 Gene 3 TAGCTAGC AGCTAGGCTC CTAGCTAGCTAGGCTC AGCTAGC AGCTCGCTA Annotate GCTAGCTAGC ASSEMBLE FRAGMENTS INTO A CONSENUS SEQUENCE *Using Computer AGCTCGCTAGCTAGCTAGCTAGCTAGGCTC GCTAGCTAGC AGCTCGCTAGCTA TAGCTAGC TAGCTAGCTA AGCTCGCTA GCTAGCTAGCT CTCGCTAGCTAG AGCTAGC CTAGCTAGCTAGGCTC AGCTAGGCTC 5
Participants have their own grants and financing Data are freely available US funding through National Science Foundation 6
With so many potatoes with lots of variation-what should be sequenced? Darth Tater 7
RH89-039- 16 (RH): a diploid heterozygous genotype genetic map (SH x RH), >10,000 markers Least heterozygous parent Physical map Sanger sequencing; BAC- by- BAC strategy ~6,000 BACs for full coverage 8
RHPOTKEY BAC library (78000 clones; 9-10 g.e.) Library clones fingerprinted with AFLP BAC fingerprints aligned into 6400 contigs 1600 BAC contigs anchored to RH AFLP map 9
Approx. 2000 BACs have been sequenced Chromosome 5: ~80% Chromosomes 1, 6 & 9: ~30% Relatively short tiling paths for some LGs Issues due to heterozygosity Slow and uneven progress WGS using NextGen Sequencing? 10
Initial Strategy heterozygous clone (RH89-039- 16) Contig assembly issues 2 divergent haplotypes Revised Strategy (2008 onwards) homozygous genotype (DM1-3 516R44) Reduced assembly issues 1 haplotype 0 1 0 0 11
Doubled monoploid line DM 1-3 516 R44 of adapted Solanum tuberosum Group Phureja (from Richard Veilleux, Virginia Tech, USA) Reduced complexity for whole genome shotgun sequencing due to homozygosity Taxonomic study (Spooner et al. 2007) suggest it is same species as S. tuberosum Very slow growing, presumably due to increased genetic load caused by exposure of inferior alleles to environment and homozygosity 12
Whole Genome Shotgun of two genotypes - RH89-039- 16 (RH) diploid heterozygote - DM1-3 516R44 (DM) diploid homozygote Illumina short read + Roche WGS RNA seq: transcriptome resource For DM; BAC end and Fosmid end sequencing (Sanger)long- range scaffolding) 13
Genome estimated to be ~850 Million bases Assembled size ~730 Mb QC on assembly suggests it is of high quality Compare DM BAC sequences with assembly Also use paired end sequence Assembly v3 looks good 14
PGSC Mapping group several partners mapping assembly to new map using different sequence- based marker types: SNP, SSR, DArT In silico anchoring using RH WGP, PoMaMo & SGN maps Target: - >90% of assembly anchored to genetic map 15
16
What are we interested in annotating? Genes where, what, when -Annotated ~40,000 genes -Used deep transcriptome sequencing (45 libraries from RH and DM) to annotate genes and determine expression profiling patterns -In the process of refining the annotation; some made available now 17
Still in the process of fixing some assembly and annotation issues 18
19
Using the potato genome sequence! Access: http://www.potatogenome.net/ Agree to the Data Access Agreement -BLAST against your query sequence -Download the mfasta file of scaffolds -View genome through the Genome Browser 20
In Class Exercise Reads > Contigs/Scaffolds (PGSC0003DMS) > Super Contigs/Super Scaffolds (PGSC0003DMB) http://www.potatogenome.net Intro to PGSC Link to Data http://potatogenomics.plantbiology.msu.edu Test Sequence: Rubisco: GenBank Accession # J03613.1 Google ncbi entrez http://www.ncbi.nlm.nih.gov/sites/gquery?itool=toolbar Download (or copy) as a fasta formatted sequence
Lets BLAST this gene against v3 of the DM assembly Go to BLAST page: Paste sequence, Select blastn Get alignment hits (PGSC0003DMS000001195 Length = 311,235); look at the alignment (see gapped alignment) 1 180 265130 265309 175 315 265389 *Note there is a paralog present in the DM genome (second best hit) 314 546 265529 265611 265843 Find this scaffold on the Genome Browser. NOTE THE GENOME BROWSER IS SUPERSCAFFOLD (SUPERCONTIG) based. Paste PGSC0003DMS000001195 in the Landmark box, hit return Zoom out to 1 MB to get a perspective of this scaffold/contig to other scaffolds/contigs NOTE THE SCAFFOLDS/CONTIGS CAN BE PLACED IN EITHER ORIENTATION IN SUPERSCAFFOLD/SUPERCONTIG Zoom in on PGSC0003DMS000001195; zoom in on 260-270kb region or 331-260 kb region
Using the Potato Genome Browser Instructions Panel: Bookmark, Hide Banner, High Resolution image, RESET Search: Landmark or Region Use Scaffold name Scroll/Zoom: View selection box Move to the left either 50 or 100% Move to the right either 50 or 100% Zoom in/out 10% Flip sequence Overview: Select region to view using the rubberband Tracks: Select which tracks to view; Update Image Configure track order, color, etc Display Settings: Show tracks Show tooltips Track Name 23
BLAST sequence search tool to identify sequences via sequence similarity: Step 1: Go to the PGSC BLAST site at http://potatogenomics.plantbiology.msu.edu/index.php?p=blast Step 2: Select the type of search that you which to use. Note that only BLASTN, TBLASTN, and TBLASTX is supported Step 3: Paste your favorite sequence into the search box in the FASTA format Step 4: Select the database you wish to search. The potato genome sequence is Solanum phureja scaffolds v3. Also provided are databases of BAC and BAC end sequences from S. phureja and S. tuberosum as well as transcript (PUTs) assemblies of potato from the ISU PlantGDB project (plantgdb.org). 24
Step 5: Submit your sequence for a BLAST search. An intermediate page will appear telling you that your search is in progress and that the results will be held for 15 minutes via a specific URL. In your BLAST results, the DM sequence is represented as scaffolds. A sample scaffold is listed below: PGSC0003DMS000000150 PGSC0003DM: denotes the PGSC version 3 assembly S: Scaffold 000000150: Unique identifier Step 6: A link is available that allows you to download your scaffold sequence(s) of interest directly from the BLAST report. In the table of hits, simply click on the scaffold accession you wish to download and you will be presented with the PGSC data access agreement. After you accept the terms of the agreement, the scaffold file will be retrieved and packaged, and you should be prompted by your browser to save the file. Note that the DM 1-3 scaffold sequences are being made available under the terms of the PGSC data access agreement, so you must read and agree to these terms before downloading the full scaffold database. 25
Download of the potato genome sequence While the BLAST site will assist in identifying your sequence within the PGSC DM genome assembly, you will need to download the sequence from the PGSC web site to access the scaffold and genome sequence. Step 1: Go to http://potatogenomics.plantbiology.msu.edu/index.php? p=download Step 2: To download the PGSC DM scaffolds, select Solanum_phureja.DM.scaffolds-v3.tar.bz2 Step 3: Read and if you agree to the data access agreement, click on Yes, I agree to these terms Step 4: The sequence databases are packaged using the Tar (http:// www.gnu.org/software/tar/) archiver, and then compressed using the bzip2 (http://www.bzip.org/) compression software. These programs are generally available on a linux machine; on a Windows machine, a number of applications are available that should be capable of extracting a bzip2- compressed tar file, including WinZip, WinRAR, and WinAce. Note: This file will be LARGE (185 Mb) and will take sometime to download. 26
Step 5: In the uncompressed file will be: README: A description of the DM Scaffolds Data_access: Statement of data access agreement PGSC0003DMS.fa: Multi-fasta file of the scaffolds Step 6: How to retrieve a specific sequences from the multi-fasta file. You can use any text editor that is capable of opening large files and doing a text search, for example 'vim' in Linux (http://blog.interlinked.org/tutorials/ vim_tutorial.html ) or 'vim' in Windows( http://www.vim.org/ download.php#pc), 'textedit' on a Mac, or 'wordpad' on Windows. Better still, there are a number of utilities available for retrieving individual records from a fasta sequence database. The NCBI BLAST package has a utility called 'fastacmd' that serves this purpose, the equivalent utility in the WUBLAST package is called 'xdget'. Other tools are available with packages such as EMBOSS or exonerate that will also allow you to index and fetch sequences from a fasta database. 27