The parrot genome: using 454 Flx+ sequencing to identify regulatory traits of vocal learning

Similar documents
De novo genome assembly with next generation sequencing data!! "

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

De Novo Assembly of High-throughput Short Read Sequences

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

Mate-pair library data improves genome assembly

De novo whole genome assembly

A Roadmap to the De-novo Assembly of the Banana Slug Genome

De novo whole genome assembly

Genome Assembly Workshop Titles and Abstracts

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

Gap Filling for a Human MHC Haplotype Sequence

Next-Generation Sequencing. Technologies

Assembly of Ariolimax dolichophallus using SOAPdenovo2

Title: High-quality genome assembly of channel catfish, Ictalurus punctatus

Targeted Sequencing Using Droplet-Based Microfluidics. Keith Brown Director, Sales

Genome Assembly, part II. Tandy Warnow

Genome Sequencing-- Strategies

Comprehensive Views of Genetic Diversity with Single Molecule, Real-Time (SMRT) Sequencing

Next Generation Sequencing. Jeroen Van Houdt - Leuven 13/10/2017

De novo Genome Assembly

A near perfect de novo assembly of a eukaryotic genome using sequence reads of greater than 10 kilobases generated by the Pacific Biosciences RS II

NOW GENERATION SEQUENCING. Monday, December 5, 11

Haploid Assembly of Diploid Genomes

Workflow of de novo assembly

Outline. DNA Sequencing. Whole Genome Shotgun Sequencing. Sequencing Coverage. Whole Genome Shotgun Sequencing 3/28/15

Announcements. Coffee! Evalua,on. Dr. Yoshiki Sasai, R.I.P.

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material

BENG 183 Trey Ideker. Genome Assembly and Physical Mapping

Third Generation Sequencing

Analysis of Structural Variants using 3 rd generation Sequencing

Genomics AGRY Michael Gribskov Hock 331

Sequencing the genomes of Nicotiana sylvestris and Nicotiana tomentosiformis Nicolas Sierro

Molecular Biology: DNA sequencing

Next Generation Sequencing Lecture Saarbrücken, 19. March Sequencing Platforms

The Genome Analysis Centre. Building Excellence in Genomics and Computational Bioscience

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

RADSeq Data Analysis. Through STACKS on Galaxy. Yvan Le Bras Anthony Bretaudeau Cyril Monjeaud Gildas Le Corguillé

The tomato genome re-seq project

Slide 1. Slide 2. Slide 3

Human genome sequence

Analysis of large deletions in human-chimp genomic alignments. Erika Kvikstad BioInformatics I December 14, 2004

CloG: a pipeline for closing gaps in a draft assembly using short reads

CSC Assignment1SequencingReview- 1109_Su N_NEXT_GENERATION_SEQUENCING.docx By Anonymous. Similarity Index

COPE: An accurate k-mer based pair-end reads connection tool to facilitate genome assembly

Genomics and Transcriptomics of Spirodela polyrhiza

SMRT-assembly Error correction and de novo assembly of complex genomes using single molecule, real-time sequencing

Bioinformatics Advice on Experimental Design

Applications of PacBio Single Molecule, Real- Time (SMRT) DNA Sequencing

Structural variation. Marta Puig Institut de Biotecnologia i Biomedicina Universitat Autònoma de Barcelona

Hybrid Error Correction and De Novo Assembly with Oxford Nanopore

Incorporating Molecular ID Technology. Accel-NGS 2S MID Indexing Kits

Ultrasequencing: Methods and Applications of the New Generation Sequencing Platforms

March 20-23, 2010 Sacramento, CA

Introduction to Bioinformatics. Genome sequencing & assembly

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Monday June 16, 2014

Outline. General principles of clonal sequencing Analysis principles Applications CNV analysis Genome architecture

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

Opportunities offered by new sequencing technologies

How much sequencing do I need? Emily Crisovan Genomics Core

Bayesian Networks as framework for data integration

DNA-Sequencing. Technologies & Devices. Matthias Platzer. Genome Analysis Leibniz Institute on Aging - Fritz Lipmann Institute (FLI)

Research school methods seminar Genomics and Transcriptomics

Title: Genome sequence of lineage III Listeria monocytogenes strain HCC23

De novo genome assembly. Dr Torsten Seemann

DNA-Sequencing. Technologies & Devices. Matthias Platzer. Genome Analysis Leibniz Institute on Aging - Fritz Lipmann Institute (FLI)

CM581A2: NEXT GENERATION SEQUENCING PLATFORMS AND LIBRARY GENERATION

Applying Genotyping by Sequencing (GBS) to Corn Genetics and Breeding. Peter Bradbury USDA/Cornell University

Introduction to Bioinformatics

Next Generation Sequencing for Metagenomics

De novo assembly of human genomes with massively parallel short read sequencing

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Tuesday December 16, 2014

Shuji Shigenobu. April 3, 2013 Illumina Webinar Series

Hunting Down the Papaya Transgenes

Next Generation Sequencing Technologies. Rob Mitra 1/30/17

Modern Epigenomics. Histone Code

Next Gen Sequencing. Expansion of sequencing technology. Contents

Genome sequence of Acinetobacter baumannii MDR-TJ

Bioinformatics and computational tools

Typically, to be biologically related means to share a common ancestor. In biology, we call this homologous

Livestock Genomics: The Odyssey

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Number and length distributions of the inferred fosmids.

Lectures 18, 19: Sequence Assembly. Spring 2017 April 13, 18, 2017

1000 Insect Transcriptomes Evolution - 1KITE

Local assembly and pre-mrna splicing analyses by high-throughput sequencing data

RIPTIDE HIGH THROUGHPUT RAPID LIBRARY PREP (HT-RLP)

Sequencing Theory. Brett E. Pickett, Ph.D. J. Craig Venter Institute

RNA-Seq analysis workshop

High Throughput Sequencing Technologies. UCD Genome Center Bioinformatics Core Monday 15 June 2015

What the Genome of Raffaelea lauricola Can Tell Us About Laurel Wilt

1. A brief overview of sequencing biochemistry

Each cell of a living organism contains chromosomes

Mapping strategies for sequence reads

Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory

NGS technologies approaches, applications and challenges!

HiSeqTM 2000 Sequencing System

Transcription:

The parrot genome: using 454 Flx+ sequencing to identify regulatory traits of vocal learning Erich D. Jarvis Howard Hughes Medical Institute Investigator Duke University Medical Center Department of Neurobiology China Roche 454 Meetings September 2011

Motivation: Deciphering the genetic basis of convergent complex traits. Challenges: De-novo genome sequencing and assembly of species with and without the traits of interest. Proper genome assembly and tools for interrogating the genomes.

Motivation: Deciphering the genetic basis of convergent complex traits. Challenges: De-novo genome sequencing and assembly of species with and without the traits of interest. Proper genome assembly and tools for interrogating the genomes.

5 GROUPS OF MAMMALS HUMANS CETACEANS BATS ELEPHANTS SEA LIONS VOCAL LEARNING (production learning) 3 GROUPS OF BIRDS PARROTS HUMMINGBIRDS SONGBIRDS Different from auditory learning (comprehension and usage learning) Auditory Learning: Dogs can understand the sounds sit (English), sientese (Spanish), osuwari (Japanese). Vocal Learning: Dogs can not learn to say these sounds, but vocal learners can.

Convergent behavior: vocal learning substrate for speech AVIAN FAMILY TREE only humans * Vocal learners * Hackett et al 2008 tree Depends on auditory feedback, vocal critical periods, cultural transmission, syntax, Deaf-induced vocal disorders, aphasias, speech sound disorder, possibly autism, * *

African Grey Parrot - training to count (concept of one) Pepperberg/Alex

Song & speech systems in birds and humans Jarvis 2004 Ann NY Acad Sci; Jarvis et al 2005 Nature Rev. Neurosci.

Behaviorally regulated egr1 expression in parrot brain Feenders et al 2008 PLoS ONE

Convergent evolution of vocal learning pathways Three alternative hypotheses - Multiple independent gains - Multiple independent losses from common ancestor - Everyone to varying degrees Vocal learning pathways Vocal production pathway Auditory Learning Modified from: Jarvis et al Nature 2000

Vocal learning brain pathways in birds & humans Jarvis et al Nature 2000; Jarvis 2004 Ann NY Acad Sci Jarvis 2004 Ann NY Acad Sci

FoxP2 - language associated gene Turned on at high levels before vocal imitation starts and is turned down to low levels after vocal learning is complete FoxP2 in finch brain Days Old 0 30 60 90 120 hatch juvenile song adult tutor song learning complete Haesler, Wada, Nshdejahn, Morrisey, Lints, Jarvis, Scharff. 2004 J. Neurosci.

RNAi knockdown of FoxP2 in songbirds Haesler et al 2007 PLoS Biology.

RNAi knockdown of FoxP2 in songbirds Haesler et al 2007 PLoS Biology.

Dusp1 gene shows specialized regulation in song nuclei (Immediate early gene involved in neuroprotection) Egr1 Dusp1 Haruhito Horita (graduate student) Graduating 2011 Horita et al (submitted)

Dusp1 shows convergent specialized regulation in song nuclei Silent Singing Songbird Hummingbird Parrot Horita et al (submitted)

Motivation: Deciphering the genetic basis of convergent complex traits. Challenges: De-novo genome sequencing and assembly of species with and without the traits of interest. Proper genome assembly and tools for interrogating the genomes.

0.4 0.8 1.2 1.6 2.0 2.4 2.8 3.2 3.6 4.0 4.4 4.8 5.2 5.6 6.0 6.4 6.8 7.2 7.6 8.0 Add PES Map to # of contig Genome representation (%) Simulated Projection: Sequence & Assembly of Avian Genomes 300,000 250,000 Contig Assembly 100.0 90.0 80.0 200,000 70.0 60.0 150,000 50.0 100,000 40.0 30.0 50,000 * 3000 * 1000 20.0 10.0 0 0.0 Sequencing data (0.4 Gbp/454 Titanium Runs)

No matter how much sequencing, could not get full coverage on some genes. Why? Map budgie sequences from GS 454 runs to three homologous zebra finch genes Gene Gene +/-5Kb Coverage Coding region length Exon coverage 5Kb upstream exons 5Kb downstream exons Identity cutoff: 90% for 40 bp; 10 GS 454 Runs FoxP2 409,706 2,136 bp 97.05% 10.16% 72.09% ROBO1 384,230 4,243 bp 91.52% 15.20% 32.83% egr1 12,949 1,533 bp 81.28% 5.98% 1.25% Identity cutoff: 90% for 40 bp; 25 GS 454 Runs (all libraries except 8Kb) FoxP2 409,706 2,136 bp 99.00% ROBO1 384,230 4,243 bp 91.00% egr1 12,949 1,533 bp 89.60%

Sequencing runs used for assemblies 454 Reactions (14X coverage) Titanium shotgun library; 15 runs total (mode ~469bp) 4 x 3 kb Flex paired-end libraries; 5 runs total (~200 bp/end) 8 x 8 kb Flex paired-end libraries 3 runs total (~200 bp/end) 4 x 20 kb Flex paired-end libraries 5 runs total (~200 bp/end) Flex+ shotgun library. 4 runs total (mode ~760bp) Illumina Reactions (8X coverage) 200bp Illumina paired-end; 2 runs (~75bp/end) 200bp Tufts-illumina paired-end; 2 runs (~75bp/end)

Read Length of Titanium runs Average read length ~350 bp and mode ~469 bp

Read Length of Flx+ runs Average read length 674 bp and mode ~768 bp Inferred error rate under 1.7%

Compared assemblies from 3 different types of sequences with 2 assemblers Reads: 1. 454 short read only (200bp paired end; 400 bp shot gun) 2. 454 short + long read (200bp paired end; 400 + 800 bp shot gun) 3. 454 short + long read, + illumina reads (75bp paired end) Assemblers: 1. Celera Assembler (CABOG; Adam Phillipy at Univ MD) 2. Newbler Assembler (Roger Winer, James Knight et al at Roche 454; Wes Warren at Wash U)

Comparative assembly statistics In a hybrid assembly, illumina pair-end cause scaffold breakdown, because of contaminating mate pairs Assembler Parrot-Celera Parrot-Celera Sequence method 454 short 454+Illum paired Coverage 8X 14X Genome size 1.2Gb 1.2Gb [Scaffolds] TotalBasesInScaffolds 1,022,398,844 1,032,788,935 # of Scaffolds 9,586 10,813 AvgScaffoldSize 106,655 98,174 N50ScaffoldSize 9,471,817 1,689,431 LargestScaffoldSize 55,691,819 7,090,199 Total gaps in scaffolds 131,248 99,828 [Contigs] # of Contigs 170,049 110,641 AvgContigSize 6,012 9,335 N50ContigSize 10,005 18,667 LargestContigSize 150,395 228,978

Comparative assembly statistics Repair of breakdown; 454 long reads enhance assembly statistics; good as Sanger method Assembler Parrot-Celera Parrot-Celera Parrot-Celera Parrot-Newbler Parrot-Newbler Parrot-Newbler Het Z. Finch-PCAP Chicken-PCAP Sequence method 454 short 454 long 454 long + illum 454 short 454 long 454 long + illum Sanger Sanger v2.1 Coverage 8X 14X 14X 8X 11X 13X 6X 7.1X Genome size 1.2Gb 1.2Gb 1.2Gb 1.2Gb 1.2Gb 1.2Gb 1.2Gb 1.05Gb [Scaffolds] TotalBasesInScaffolds 1,022,398,844 1,079,493,948 1,086,605,544 1,232,754,888 1,179,562,588 1,128,262,411 1,224,525,252 1,047,124,295 # of Scaffolds 9,586 20,685 25,212 37,024 21,081 10,926 37,698 23,776 AvgScaffoldSize 106,655 52,187 43,099 33,296 55,953 103,263 32,482 44,041 N50ScaffoldSize 9,471,817 12,449,215 11,201,952 4,019,469 7,285,721 6,386,522 10,409,499 11,125,310 LargestScaffoldSize 55,691,819 49,398,065 39,879,305 18,557,224 39,887,084 35,673,135 56,620,707 51,053,708 Total gaps in scaffolds 160,463 54,864 45,651 60,834 124,736 [Contigs] # of Contigs 170,049 75,549 70,863 224,563 222,786 71,760 126,053 85,191 AvgContigSize 6,012 14,289 15,334 4,627 4,821 14,368 9,714 12,291 N50ContigSize 10,005 41,251 55,633 8,622 14,413 27,014 38,549 45,280 LargestContigSize 150,395 405,483 465,633 224,563 222,786 359,884 424,635 624,663

Mummer plot of synteny between Zebra Finch and Budgie draft assemblies: A snapshot of Chr 4 FLX PE, 454 Short reads 100s scaffold FLX PE, 454 Short + Long Reads One ~39.9MB scaffold Zebra Finch Chr 4 [25 MB-65 MB] = 40MB www.454.com

Mummer plot of synteny between Zebra Finch and Budgie draft assemblies: A snapshot of Chr 1 FLX PE, 454 Short Reads 6 scaffolds FLX PE, 454 Short + Long Reads One ~18MB scaffold Zebra Finch Chr 18MB region

Assembly of equivalent 400 (titanium) and 760 (Flx+) bp sequence Assembly Metrics Titanium Reads, FLX PE FLX+, Titanium, FLX PE % change with FLX+ runs Sequence Depth 6 6 - estimatedgenomesize 1405.7 MB 1409.2 MB - numalignedreads 30150439, 94.48% 26736754, 94.53% - numalignedbases 8018686780, 95.20% 8019891335, 94.82% - numberassembled 29089915 25734082 - numberpartial 1057907 999888-5.48 numbersingleton 839565 721011-14.12 numberrepeat 628090 562055-10.51 numberoutlier 297929 267573-10.19 numberwithbothmapped 7177712 7242926 0.91 Scaffold Metrics numberofscaffolds 54428 53581-1.56 numberofbases 1225236944 1241702153 1.34 avgscaffoldsize 22511 23174 2.95 N50ScaffoldSize 1943393 2463264 26.75 largestscaffoldsize 13998251 15593718 11.40 LargeContigMetrics numberofcontigs 418038 302341-27.68 numberofbases 969330616 993764293 2.52 avgcontigsize 2318 3286 41.76 N50ContigSize 3252 5214 60.33 largestcontigsize 39159 57462 46.74 www.454.com

Assembly completeness of 3392 highly homologous exons Cont Scaff Cont Scaff Cont Scaff 454 Flx+ & illumina 454 Flx+ 454 Titanium Used CABOG Celera assembler with different read lengths and technologies. Cont = contigs; Scaff = scaffolds

Assembly of genes of interest Single vs multi-exon genes Egr1: 2-exon gene, with high GC rich exon 1 FoxP2: 16-exon gene, with one GC rich exon Dusp1: Gene with repetitive regulatory region Other genes? Use zebra finch exons that >87% identical between finch and chicken to find parrot exons in the assemblies and reads

Single exon genes dusp14 Nb-454 short Nb-454 long Nb-hybrid CA-454 short CA-454 long CA-hybrid Nearly all high complexity single exon genes (40-60% GC) thus far examined have full coverage (97-100%) for all assemblies. Nb = Newbler; CA = Celera; 454 short = titanium; 454 long = Flx+; hybrid = 454 short+long+illumina

BUT: Many high complexity multi exon genes (40-60% GC) on multiple scaffolds with 454 short reads using Newbler, but assembled on one scaffold using longer reads or Celera. Multi-exon genes GlurR2 assembly Nb-454 short Nb-454 long Nb-hybrid CA-454 short CA-454 long CA-hybrid

GC rich exons FoxP2 language evolution Nb-454 short Nb-454 long Nb-hybrid CA-454 short CA-454 long CA-hybrid GC rich exons (>70%) have poorer assembly. Some algorithms can still handle them. Nb = Newbler; CA = Celera; 454 short = titanium; 454 long = Flx+; hybrid = 454 short+long+illumina

GC rich exons Dusp6 behaviorally regulated gene Nb-454 Nb-454 long Nb-hybrid CA-454 CA-454 long CA-hybrid EXON 1 missing from some assemblies of the dusp6 gene. What happened? Nb = Newbler; CA = Celera; 454 short = titanium; 454 long = Flx+; hybrid = 454 short+long+illumina

Dusp6 reads Sufficient exon 1 reads & overlaps for assembly

GC rich exons Dusp6 assembly Nb-454 Nb-454 long Nb-hybrid CA-454 CA-454 long CA-hybrid Conclusions: Newbler - GC exons (60-70%) not brought into scaffold for 454 reads (is contigs), because it was part of alternative paths. 454+illumina hybrid resolved assembly. Celera GC exons (60-70%) in 454 short (400bp) reads placed in degenerate file and not assembled; but long reads (760bp), sequence no longer labeled degenerate and thus assembled.

GC rich exons Egr1 behaviorally regulated gene Nb-454 short Nb-454 long Nb-hybrid CA-454 short CA-454 long CA-hybrid EXON 1 missing from all assemblies of egr1 gene. What happened?

GC rich exons Egr1 reads shot gun No reads of exon 1 in shot gun. GC rich exon (80%)

GC rich exons Egr1 reads paired-end Very few reads of exon 1 in paired-end. GC rich exon (88%)

GC rich promoter and exon Egr1 gene assembly Part of promoter and exon 1 missing in all assemblies

Even sanger method missing GC rich regions: Egr1 assembly finch Zebra finch genome Chicken genome Parrot genome All species missing GC rich promoter region (75-90%)

~1,200 bp regulatory region of various microsatellite repeats In dusp1 regulatory region GGGATAACAGCACAGCCCTTAAACCCCCCTGGGGTAACAGGACAGCCCTTAAACCCCCCTGGGGTAACTGAGA ACAACCCTTAAACCCCCCTGGGGTAACAGCACAGCTCTTAAACCCCGAATTCTGAATCCACCCTGGCCCCATG GAGCATACACAGAGTGTGTGTGTGAATATGTGATTTTCTGTGTGAATATGTGATTTTGTGTGAATATGTGATT TTGTGTGCGAATATGTGATTCTGTGTGTGAATATGTGATTCTGTGTGTGAATATGTCATTTTCTGTGTGAATA TGTGATTTTGTGTGAATGTGTGATTTTCTGTGTGAATATGTGATAATATGTGATTTTGTGTGTGAATATGTGA TTCTATGTGAATATGTGATTGATTTTCTGTGTGAATATGTGATTTTGTGTGAATGTGTGATTTTTGTGTGAAT ATGTGATTTTCTGTGTGAATATGTGATTTTCTGTGTGAATATGTGATTTTTCAGAAAGTCGCAGGGTGGTTTG GCTCACACTCGCACTCACACTCTCACACACTCACACTCTCTCACTCTCACTCACACTCACACTCACACTCTCA CACTCTCTCACACTCTCTCACACTCTCACACTCTCTCACACACACACTCATACACTCCCACTCACACATACTC TCACACTCACACACTCTCACACTCTCACACTCTAACACACTCACACACTCACACACTCACACTCACACTCATA CTCACACACTCACACACTCACACTCACACTCTAACACACTCACACACTCACACTCACACTCACTTTTTCTCTT TTCTCACTTTTTCTCTCTCCCTCTCCCGCGCTCCGCGGCCGCCCCGCTCCCGATGACGTCGCACCGGCGGGGC GGGCCGCGCCCTCGCTGGCGCGCGGCCAGGCTGACGTCATCGGCCGCCCCGCCCCCCCACGTGACGCGGCCC ATTGAGAAAACGCCGTCCCGCCGCGCGGCCCCATATAAGGGCGGGAGCGGCGGGGCACCGGGACAGCCGGGCC ACCGCACCTCTGAGCTCTGCCCTGCCCTCCTTCCCTCCCCACAGCCATCCCCGCGCTGCCCGGCCATGGTGAA CCTGCGGGTGTGCGCGCTGGACTGCGAGGCGCTGCGGGCGCTGCTGCAGGAGCGCGGCGCGCAGTGCCTCGTC CTCGACTGCCGCTCCTTCTTCTCCTTCAA Horita et al (submitted)

Dusp1 convergent promoter changes in vocal learners Vocal learners Vocal non-learners Horita et al (submitted)

Dusp1 convergent promoter changes in vocal learners Vocal learners Vocal non-learners Horita et al (submitted)

Repetitive microsatellite assembly in dusp1 promoter ATG Nb-454 Nb-454 long Nb-hybrid CA-454 CA-454 long CA-hybrid Conclusions: Only the long reads (~760bp) allowed full and correct assembly of microsattelite repetitive sequence in the parrot dusp1 promoter.

Genome 10000 (G10K) consortium: Assemblathon 2 competition - parrot Three technologies 454 short (200bp) & long (750 bp) read lengths, shotgun and paired end with 3, 8, 20 Kb insert sizes, 16X coverage (Roche and Duke) Illumina HiSeq(100 bp) paired-end/mate pair reads, 0.2, 0.5, 0.8, 5, 10, 20 and 40Kb insert sizes paired end/mate pair with TruSeq v3 GC chemistry, 120X coverage (BGI & Illumina). Pacbio reads (~3000 bp read length avg, but 15% error), 7, 10Kb insert sizes, 5X coverage (Pacbio)

Genome 10000 (G10K) consortium: Assemblathon 2 competition - parrot Three technologies 454 long Flx+ Illumina HiSeq. Pacbio long 25 assembly groups: Overlap-Layout-Consensus (e.g. Celera CABOG, PCAP, Newbler, etc.) Eulerian debruijn graps (e.g. ALLPaths, SoapDenovo, Velvet, etc.) Hybrid inventions

Genome 10000 (G10K) consortium: Assemblathon 2 competition - parrot Three technologies 454 long Flx+ Illumina HiSeq. Pacbio long 25 assembly groups: Overlap-Layout-Consensus (e.g. Celera CABOG, PCAP, Newbler, etc.) Eulerian debruijn graps (e.g. ALLPaths, SoapDenovo, Velvet, etc.) Hybrid inventions Two validation methods: Optical maps (contig and scaffold accuracy) 40K pooled (10) fosmid and single molecule clones sequenced (bp accuracy)

Bp coverage Challenges for the future for Flex+ Limitations Cost vs Assembly bp acurarcy vs Assembly completeness Algorithms for hybrid assemblies Overcoming GC rich anti-bias 100X $ low $ high Theoretical predictions to generate high quality assembly 5X $ low 1 Read length 1500

Challenges for complete genome assembly Theoretical predictions to generate high quality assembly Close to theory on Dog genome long reads; Less than theory on Panda short reads Schatz et al 2010 Genome Research

Jarvis Lab Jason Howard James Ward (Now at NIEHS) Ganesh Ganapathy Haruhito Horita Roche 454 sequencing Duke Genome Center Lisa Bukovnik Ty Wang Olivier Fedrigo Roche support team Xuemin Liu Chinnappa Kodira Illumina sequencing Tin Le (Illumina UK) Guojie Zhang (BGI) Yingrui Li (BGI) Pacbio sequencing Eric Schadt Edwin Hawe Lawrence Lee Acknowledgements Assembly Adam Phillipy (CABOG; Univ Maryland) Sergy Koren (CABOG; Univ Maryland) Wes Warren (Newbler; Wash Univ) James Knight (Newbler; Roche 454) Roger Winer (Newbler; Roche 454) Bo Li (SoapDenovo; BGI) Optical maps David Schwartz Shiguo Zhou Fosmids Jay Shendure Funding NIH Director s Pioneer Award Howard Hughes Medical Institute

Previous students and Post Docs now with own labs Dr. Lubica Kubikova Dr. Raphael Pinaud Dr. V. Ann Smith Dr. Liisa Tremere Dr. Kazuhiro Wada Dr. Jing Yu Rui Wang Dr. Osceola Whitney Jason Howard Haru Horita Jarvis lab Maurice Anderson Eric Zhou Michael Silva Gustavo Arriaga Dr. Petra Roulhac Gurkan Yardimchi Andreas Pfenning Dr. Erich Tony Jarvis Zimmermann Theresa Renuart Dr. Miriam Rivas Dr. Chun-Chun Chen Alisa Ray Erina Hara Not present: Nicole Nelson Alyssa Zhu