The parrot genome: using 454 Flx+ sequencing to identify regulatory traits of vocal learning Erich D. Jarvis Howard Hughes Medical Institute Investigator Duke University Medical Center Department of Neurobiology China Roche 454 Meetings September 2011
Motivation: Deciphering the genetic basis of convergent complex traits. Challenges: De-novo genome sequencing and assembly of species with and without the traits of interest. Proper genome assembly and tools for interrogating the genomes.
Motivation: Deciphering the genetic basis of convergent complex traits. Challenges: De-novo genome sequencing and assembly of species with and without the traits of interest. Proper genome assembly and tools for interrogating the genomes.
5 GROUPS OF MAMMALS HUMANS CETACEANS BATS ELEPHANTS SEA LIONS VOCAL LEARNING (production learning) 3 GROUPS OF BIRDS PARROTS HUMMINGBIRDS SONGBIRDS Different from auditory learning (comprehension and usage learning) Auditory Learning: Dogs can understand the sounds sit (English), sientese (Spanish), osuwari (Japanese). Vocal Learning: Dogs can not learn to say these sounds, but vocal learners can.
Convergent behavior: vocal learning substrate for speech AVIAN FAMILY TREE only humans * Vocal learners * Hackett et al 2008 tree Depends on auditory feedback, vocal critical periods, cultural transmission, syntax, Deaf-induced vocal disorders, aphasias, speech sound disorder, possibly autism, * *
African Grey Parrot - training to count (concept of one) Pepperberg/Alex
Song & speech systems in birds and humans Jarvis 2004 Ann NY Acad Sci; Jarvis et al 2005 Nature Rev. Neurosci.
Behaviorally regulated egr1 expression in parrot brain Feenders et al 2008 PLoS ONE
Convergent evolution of vocal learning pathways Three alternative hypotheses - Multiple independent gains - Multiple independent losses from common ancestor - Everyone to varying degrees Vocal learning pathways Vocal production pathway Auditory Learning Modified from: Jarvis et al Nature 2000
Vocal learning brain pathways in birds & humans Jarvis et al Nature 2000; Jarvis 2004 Ann NY Acad Sci Jarvis 2004 Ann NY Acad Sci
FoxP2 - language associated gene Turned on at high levels before vocal imitation starts and is turned down to low levels after vocal learning is complete FoxP2 in finch brain Days Old 0 30 60 90 120 hatch juvenile song adult tutor song learning complete Haesler, Wada, Nshdejahn, Morrisey, Lints, Jarvis, Scharff. 2004 J. Neurosci.
RNAi knockdown of FoxP2 in songbirds Haesler et al 2007 PLoS Biology.
RNAi knockdown of FoxP2 in songbirds Haesler et al 2007 PLoS Biology.
Dusp1 gene shows specialized regulation in song nuclei (Immediate early gene involved in neuroprotection) Egr1 Dusp1 Haruhito Horita (graduate student) Graduating 2011 Horita et al (submitted)
Dusp1 shows convergent specialized regulation in song nuclei Silent Singing Songbird Hummingbird Parrot Horita et al (submitted)
Motivation: Deciphering the genetic basis of convergent complex traits. Challenges: De-novo genome sequencing and assembly of species with and without the traits of interest. Proper genome assembly and tools for interrogating the genomes.
0.4 0.8 1.2 1.6 2.0 2.4 2.8 3.2 3.6 4.0 4.4 4.8 5.2 5.6 6.0 6.4 6.8 7.2 7.6 8.0 Add PES Map to # of contig Genome representation (%) Simulated Projection: Sequence & Assembly of Avian Genomes 300,000 250,000 Contig Assembly 100.0 90.0 80.0 200,000 70.0 60.0 150,000 50.0 100,000 40.0 30.0 50,000 * 3000 * 1000 20.0 10.0 0 0.0 Sequencing data (0.4 Gbp/454 Titanium Runs)
No matter how much sequencing, could not get full coverage on some genes. Why? Map budgie sequences from GS 454 runs to three homologous zebra finch genes Gene Gene +/-5Kb Coverage Coding region length Exon coverage 5Kb upstream exons 5Kb downstream exons Identity cutoff: 90% for 40 bp; 10 GS 454 Runs FoxP2 409,706 2,136 bp 97.05% 10.16% 72.09% ROBO1 384,230 4,243 bp 91.52% 15.20% 32.83% egr1 12,949 1,533 bp 81.28% 5.98% 1.25% Identity cutoff: 90% for 40 bp; 25 GS 454 Runs (all libraries except 8Kb) FoxP2 409,706 2,136 bp 99.00% ROBO1 384,230 4,243 bp 91.00% egr1 12,949 1,533 bp 89.60%
Sequencing runs used for assemblies 454 Reactions (14X coverage) Titanium shotgun library; 15 runs total (mode ~469bp) 4 x 3 kb Flex paired-end libraries; 5 runs total (~200 bp/end) 8 x 8 kb Flex paired-end libraries 3 runs total (~200 bp/end) 4 x 20 kb Flex paired-end libraries 5 runs total (~200 bp/end) Flex+ shotgun library. 4 runs total (mode ~760bp) Illumina Reactions (8X coverage) 200bp Illumina paired-end; 2 runs (~75bp/end) 200bp Tufts-illumina paired-end; 2 runs (~75bp/end)
Read Length of Titanium runs Average read length ~350 bp and mode ~469 bp
Read Length of Flx+ runs Average read length 674 bp and mode ~768 bp Inferred error rate under 1.7%
Compared assemblies from 3 different types of sequences with 2 assemblers Reads: 1. 454 short read only (200bp paired end; 400 bp shot gun) 2. 454 short + long read (200bp paired end; 400 + 800 bp shot gun) 3. 454 short + long read, + illumina reads (75bp paired end) Assemblers: 1. Celera Assembler (CABOG; Adam Phillipy at Univ MD) 2. Newbler Assembler (Roger Winer, James Knight et al at Roche 454; Wes Warren at Wash U)
Comparative assembly statistics In a hybrid assembly, illumina pair-end cause scaffold breakdown, because of contaminating mate pairs Assembler Parrot-Celera Parrot-Celera Sequence method 454 short 454+Illum paired Coverage 8X 14X Genome size 1.2Gb 1.2Gb [Scaffolds] TotalBasesInScaffolds 1,022,398,844 1,032,788,935 # of Scaffolds 9,586 10,813 AvgScaffoldSize 106,655 98,174 N50ScaffoldSize 9,471,817 1,689,431 LargestScaffoldSize 55,691,819 7,090,199 Total gaps in scaffolds 131,248 99,828 [Contigs] # of Contigs 170,049 110,641 AvgContigSize 6,012 9,335 N50ContigSize 10,005 18,667 LargestContigSize 150,395 228,978
Comparative assembly statistics Repair of breakdown; 454 long reads enhance assembly statistics; good as Sanger method Assembler Parrot-Celera Parrot-Celera Parrot-Celera Parrot-Newbler Parrot-Newbler Parrot-Newbler Het Z. Finch-PCAP Chicken-PCAP Sequence method 454 short 454 long 454 long + illum 454 short 454 long 454 long + illum Sanger Sanger v2.1 Coverage 8X 14X 14X 8X 11X 13X 6X 7.1X Genome size 1.2Gb 1.2Gb 1.2Gb 1.2Gb 1.2Gb 1.2Gb 1.2Gb 1.05Gb [Scaffolds] TotalBasesInScaffolds 1,022,398,844 1,079,493,948 1,086,605,544 1,232,754,888 1,179,562,588 1,128,262,411 1,224,525,252 1,047,124,295 # of Scaffolds 9,586 20,685 25,212 37,024 21,081 10,926 37,698 23,776 AvgScaffoldSize 106,655 52,187 43,099 33,296 55,953 103,263 32,482 44,041 N50ScaffoldSize 9,471,817 12,449,215 11,201,952 4,019,469 7,285,721 6,386,522 10,409,499 11,125,310 LargestScaffoldSize 55,691,819 49,398,065 39,879,305 18,557,224 39,887,084 35,673,135 56,620,707 51,053,708 Total gaps in scaffolds 160,463 54,864 45,651 60,834 124,736 [Contigs] # of Contigs 170,049 75,549 70,863 224,563 222,786 71,760 126,053 85,191 AvgContigSize 6,012 14,289 15,334 4,627 4,821 14,368 9,714 12,291 N50ContigSize 10,005 41,251 55,633 8,622 14,413 27,014 38,549 45,280 LargestContigSize 150,395 405,483 465,633 224,563 222,786 359,884 424,635 624,663
Mummer plot of synteny between Zebra Finch and Budgie draft assemblies: A snapshot of Chr 4 FLX PE, 454 Short reads 100s scaffold FLX PE, 454 Short + Long Reads One ~39.9MB scaffold Zebra Finch Chr 4 [25 MB-65 MB] = 40MB www.454.com
Mummer plot of synteny between Zebra Finch and Budgie draft assemblies: A snapshot of Chr 1 FLX PE, 454 Short Reads 6 scaffolds FLX PE, 454 Short + Long Reads One ~18MB scaffold Zebra Finch Chr 18MB region
Assembly of equivalent 400 (titanium) and 760 (Flx+) bp sequence Assembly Metrics Titanium Reads, FLX PE FLX+, Titanium, FLX PE % change with FLX+ runs Sequence Depth 6 6 - estimatedgenomesize 1405.7 MB 1409.2 MB - numalignedreads 30150439, 94.48% 26736754, 94.53% - numalignedbases 8018686780, 95.20% 8019891335, 94.82% - numberassembled 29089915 25734082 - numberpartial 1057907 999888-5.48 numbersingleton 839565 721011-14.12 numberrepeat 628090 562055-10.51 numberoutlier 297929 267573-10.19 numberwithbothmapped 7177712 7242926 0.91 Scaffold Metrics numberofscaffolds 54428 53581-1.56 numberofbases 1225236944 1241702153 1.34 avgscaffoldsize 22511 23174 2.95 N50ScaffoldSize 1943393 2463264 26.75 largestscaffoldsize 13998251 15593718 11.40 LargeContigMetrics numberofcontigs 418038 302341-27.68 numberofbases 969330616 993764293 2.52 avgcontigsize 2318 3286 41.76 N50ContigSize 3252 5214 60.33 largestcontigsize 39159 57462 46.74 www.454.com
Assembly completeness of 3392 highly homologous exons Cont Scaff Cont Scaff Cont Scaff 454 Flx+ & illumina 454 Flx+ 454 Titanium Used CABOG Celera assembler with different read lengths and technologies. Cont = contigs; Scaff = scaffolds
Assembly of genes of interest Single vs multi-exon genes Egr1: 2-exon gene, with high GC rich exon 1 FoxP2: 16-exon gene, with one GC rich exon Dusp1: Gene with repetitive regulatory region Other genes? Use zebra finch exons that >87% identical between finch and chicken to find parrot exons in the assemblies and reads
Single exon genes dusp14 Nb-454 short Nb-454 long Nb-hybrid CA-454 short CA-454 long CA-hybrid Nearly all high complexity single exon genes (40-60% GC) thus far examined have full coverage (97-100%) for all assemblies. Nb = Newbler; CA = Celera; 454 short = titanium; 454 long = Flx+; hybrid = 454 short+long+illumina
BUT: Many high complexity multi exon genes (40-60% GC) on multiple scaffolds with 454 short reads using Newbler, but assembled on one scaffold using longer reads or Celera. Multi-exon genes GlurR2 assembly Nb-454 short Nb-454 long Nb-hybrid CA-454 short CA-454 long CA-hybrid
GC rich exons FoxP2 language evolution Nb-454 short Nb-454 long Nb-hybrid CA-454 short CA-454 long CA-hybrid GC rich exons (>70%) have poorer assembly. Some algorithms can still handle them. Nb = Newbler; CA = Celera; 454 short = titanium; 454 long = Flx+; hybrid = 454 short+long+illumina
GC rich exons Dusp6 behaviorally regulated gene Nb-454 Nb-454 long Nb-hybrid CA-454 CA-454 long CA-hybrid EXON 1 missing from some assemblies of the dusp6 gene. What happened? Nb = Newbler; CA = Celera; 454 short = titanium; 454 long = Flx+; hybrid = 454 short+long+illumina
Dusp6 reads Sufficient exon 1 reads & overlaps for assembly
GC rich exons Dusp6 assembly Nb-454 Nb-454 long Nb-hybrid CA-454 CA-454 long CA-hybrid Conclusions: Newbler - GC exons (60-70%) not brought into scaffold for 454 reads (is contigs), because it was part of alternative paths. 454+illumina hybrid resolved assembly. Celera GC exons (60-70%) in 454 short (400bp) reads placed in degenerate file and not assembled; but long reads (760bp), sequence no longer labeled degenerate and thus assembled.
GC rich exons Egr1 behaviorally regulated gene Nb-454 short Nb-454 long Nb-hybrid CA-454 short CA-454 long CA-hybrid EXON 1 missing from all assemblies of egr1 gene. What happened?
GC rich exons Egr1 reads shot gun No reads of exon 1 in shot gun. GC rich exon (80%)
GC rich exons Egr1 reads paired-end Very few reads of exon 1 in paired-end. GC rich exon (88%)
GC rich promoter and exon Egr1 gene assembly Part of promoter and exon 1 missing in all assemblies
Even sanger method missing GC rich regions: Egr1 assembly finch Zebra finch genome Chicken genome Parrot genome All species missing GC rich promoter region (75-90%)
~1,200 bp regulatory region of various microsatellite repeats In dusp1 regulatory region GGGATAACAGCACAGCCCTTAAACCCCCCTGGGGTAACAGGACAGCCCTTAAACCCCCCTGGGGTAACTGAGA ACAACCCTTAAACCCCCCTGGGGTAACAGCACAGCTCTTAAACCCCGAATTCTGAATCCACCCTGGCCCCATG GAGCATACACAGAGTGTGTGTGTGAATATGTGATTTTCTGTGTGAATATGTGATTTTGTGTGAATATGTGATT TTGTGTGCGAATATGTGATTCTGTGTGTGAATATGTGATTCTGTGTGTGAATATGTCATTTTCTGTGTGAATA TGTGATTTTGTGTGAATGTGTGATTTTCTGTGTGAATATGTGATAATATGTGATTTTGTGTGTGAATATGTGA TTCTATGTGAATATGTGATTGATTTTCTGTGTGAATATGTGATTTTGTGTGAATGTGTGATTTTTGTGTGAAT ATGTGATTTTCTGTGTGAATATGTGATTTTCTGTGTGAATATGTGATTTTTCAGAAAGTCGCAGGGTGGTTTG GCTCACACTCGCACTCACACTCTCACACACTCACACTCTCTCACTCTCACTCACACTCACACTCACACTCTCA CACTCTCTCACACTCTCTCACACTCTCACACTCTCTCACACACACACTCATACACTCCCACTCACACATACTC TCACACTCACACACTCTCACACTCTCACACTCTAACACACTCACACACTCACACACTCACACTCACACTCATA CTCACACACTCACACACTCACACTCACACTCTAACACACTCACACACTCACACTCACACTCACTTTTTCTCTT TTCTCACTTTTTCTCTCTCCCTCTCCCGCGCTCCGCGGCCGCCCCGCTCCCGATGACGTCGCACCGGCGGGGC GGGCCGCGCCCTCGCTGGCGCGCGGCCAGGCTGACGTCATCGGCCGCCCCGCCCCCCCACGTGACGCGGCCC ATTGAGAAAACGCCGTCCCGCCGCGCGGCCCCATATAAGGGCGGGAGCGGCGGGGCACCGGGACAGCCGGGCC ACCGCACCTCTGAGCTCTGCCCTGCCCTCCTTCCCTCCCCACAGCCATCCCCGCGCTGCCCGGCCATGGTGAA CCTGCGGGTGTGCGCGCTGGACTGCGAGGCGCTGCGGGCGCTGCTGCAGGAGCGCGGCGCGCAGTGCCTCGTC CTCGACTGCCGCTCCTTCTTCTCCTTCAA Horita et al (submitted)
Dusp1 convergent promoter changes in vocal learners Vocal learners Vocal non-learners Horita et al (submitted)
Dusp1 convergent promoter changes in vocal learners Vocal learners Vocal non-learners Horita et al (submitted)
Repetitive microsatellite assembly in dusp1 promoter ATG Nb-454 Nb-454 long Nb-hybrid CA-454 CA-454 long CA-hybrid Conclusions: Only the long reads (~760bp) allowed full and correct assembly of microsattelite repetitive sequence in the parrot dusp1 promoter.
Genome 10000 (G10K) consortium: Assemblathon 2 competition - parrot Three technologies 454 short (200bp) & long (750 bp) read lengths, shotgun and paired end with 3, 8, 20 Kb insert sizes, 16X coverage (Roche and Duke) Illumina HiSeq(100 bp) paired-end/mate pair reads, 0.2, 0.5, 0.8, 5, 10, 20 and 40Kb insert sizes paired end/mate pair with TruSeq v3 GC chemistry, 120X coverage (BGI & Illumina). Pacbio reads (~3000 bp read length avg, but 15% error), 7, 10Kb insert sizes, 5X coverage (Pacbio)
Genome 10000 (G10K) consortium: Assemblathon 2 competition - parrot Three technologies 454 long Flx+ Illumina HiSeq. Pacbio long 25 assembly groups: Overlap-Layout-Consensus (e.g. Celera CABOG, PCAP, Newbler, etc.) Eulerian debruijn graps (e.g. ALLPaths, SoapDenovo, Velvet, etc.) Hybrid inventions
Genome 10000 (G10K) consortium: Assemblathon 2 competition - parrot Three technologies 454 long Flx+ Illumina HiSeq. Pacbio long 25 assembly groups: Overlap-Layout-Consensus (e.g. Celera CABOG, PCAP, Newbler, etc.) Eulerian debruijn graps (e.g. ALLPaths, SoapDenovo, Velvet, etc.) Hybrid inventions Two validation methods: Optical maps (contig and scaffold accuracy) 40K pooled (10) fosmid and single molecule clones sequenced (bp accuracy)
Bp coverage Challenges for the future for Flex+ Limitations Cost vs Assembly bp acurarcy vs Assembly completeness Algorithms for hybrid assemblies Overcoming GC rich anti-bias 100X $ low $ high Theoretical predictions to generate high quality assembly 5X $ low 1 Read length 1500
Challenges for complete genome assembly Theoretical predictions to generate high quality assembly Close to theory on Dog genome long reads; Less than theory on Panda short reads Schatz et al 2010 Genome Research
Jarvis Lab Jason Howard James Ward (Now at NIEHS) Ganesh Ganapathy Haruhito Horita Roche 454 sequencing Duke Genome Center Lisa Bukovnik Ty Wang Olivier Fedrigo Roche support team Xuemin Liu Chinnappa Kodira Illumina sequencing Tin Le (Illumina UK) Guojie Zhang (BGI) Yingrui Li (BGI) Pacbio sequencing Eric Schadt Edwin Hawe Lawrence Lee Acknowledgements Assembly Adam Phillipy (CABOG; Univ Maryland) Sergy Koren (CABOG; Univ Maryland) Wes Warren (Newbler; Wash Univ) James Knight (Newbler; Roche 454) Roger Winer (Newbler; Roche 454) Bo Li (SoapDenovo; BGI) Optical maps David Schwartz Shiguo Zhou Fosmids Jay Shendure Funding NIH Director s Pioneer Award Howard Hughes Medical Institute
Previous students and Post Docs now with own labs Dr. Lubica Kubikova Dr. Raphael Pinaud Dr. V. Ann Smith Dr. Liisa Tremere Dr. Kazuhiro Wada Dr. Jing Yu Rui Wang Dr. Osceola Whitney Jason Howard Haru Horita Jarvis lab Maurice Anderson Eric Zhou Michael Silva Gustavo Arriaga Dr. Petra Roulhac Gurkan Yardimchi Andreas Pfenning Dr. Erich Tony Jarvis Zimmermann Theresa Renuart Dr. Miriam Rivas Dr. Chun-Chun Chen Alisa Ray Erina Hara Not present: Nicole Nelson Alyssa Zhu