Cod: first version scaffolds. 35% gap bases. < ENSGMOT protein coding. contig > contig > contig >

Size: px
Start display at page:

Download "Cod: first version scaffolds. 35% gap bases. < ENSGMOT protein coding. contig > contig > contig >"

Transcription

1 Cod: first version 6467 scaffolds contig > contig > contig > < ENSGMOT protein coding 35% gap bases

2 Cod: first version LETTER doi: /nature10342 The genome sequence of Atlantic cod reveals a unique immune system Bastiaan Star 1, Alexander J. Nederbragt 1, Sissel Jentoft 1, Unni Grimholt 1, Martin Malmstrøm 1, Tone F. Gregers 2, Trine B. Rounge 1, Jonas Paulsen 1,3, Monica H. Solbakken 1, Animesh Sharma 4, Ola F. Wetten 5,6, Anders Lanzén 7,8, Roger Winer 9, James Knight 9, Jan-Hinnerk Vogel 10, Bronwen Aken 10, Øivind Andersen 11, Karin Lagesen 1, Ave Tooming-Klunderud 1, Rolf B. Edvardsen 12, Kirubakaran G. Tina 1,13, Mari Espelund 1, Chirag Nepal 4,8, Christopher Previti 8, Bård Ove Karlsen 14, Truls Moum 14, Morten Skage 1, Paul R. Berg 1, Tor Gjøen 15, Heiner Kuhl 16, Jim Thorsen 17, Ketil Malde 12, Richard Reinhardt 16, Lei Du 9, Steinar D. Johansen 14,18, Steve Searle 10, Sigbjørn Lien 13, Frank Nilsen 19, Inge Jonassen 4,8, Stig W. Omholt 1,13, Nils Chr. Stenseth 1 & Kjetill S. Jakobsen 1 Atlantic cod (Gadus morhua) is a large, cold-adapted teleost that sustains long-standing commercial fisheries and incipient aquaculture 1,2. Here we present the genome sequence of Atlantic cod, showing evidence for complex thermal adaptations in its haemoglobin gene cluster and an unusual immune architecture compared to other sequenced vertebrates. The genome assembly was obtained exclusively by 454 sequencing of shotgun and paired-end libraries, and automated annotation identified 22,154 genes. The major histotibilit l (MHC) II i d f t f th independently assembled bacterial artificial chromosome (BAC) insert clones (Supplementary Note 14 and Supplementary Fig. 9), and with the expected insert size of paired BAC-end reads (Supplementary Note 15 and Supplementary Fig. 10). A standard annotation approach based on protein evidence was complemented by a whole-genome alignment of the Atlantic cod with the stickleback (Gasterosteus aculeatus), after repeat-masking 25.4% of the Newbler assembly (Supplementary Note 16 and Supplementary bl ) h f d kl b k

3 Cod: first version Take care when using the cod genome

4 The causes Short Tandem Repeats (>20% of gaps)

5 The causes Percentage of bases in 63 vertebrate genomes classified as short Tandem Repeats Dinucleotide4AG4(%) Cat Hedgehog Rabbit Ferret Cave4fish Mouse Rat Tetraodon Stickleback Fugu Zebrafish Atlantic4cod Dinucleotide4AC4(%) Bas aan Star, CEES

6 The causes Heterozygosity? Polymorphic con g 2 Con g 1 Con g 4 Polymorphic con g 3

7 Cod: phase 2 goal 23 pseudochromosomes Longer con gs contig > contig > contig > < ENSGMOT protein coding Below 5% gap bases

8 Many programs to choose from Celera Zhang et al. PLoSOne 2011

9 Cod: phase 2 sequencing Phase 2 Illumina sequencing Paired end >200x Mate Pair 5kb >100x

10 Cod: phase 2 computa on 2009: cod1.uio.no cod2.uio.no 2011: cod3.uio.no cod4.uio.no 24 processors, 125 GB RAM 64 processors, 512 GB RAM 2012: 20 TB disk for cod3 20 TB disk for cod4

11 Cod: phase 2 status contig > contig > contig > < ENSGMOT protein coding Fewer, longer scaffolds OR Fewer gap bases

12 Many programs to choose from Celera Zhang et al. PLoSOne 2011

13 Enter PacBio

14 Pacific Biosciences Single- molecule Photo: Tore Oldeide Elgvin

15 Pacific Biosciences

16 Pacific Biosciences Library prepara on SMRTBell'template' Standard'Sequencing' ' ' Large&Insert&Sizes& Generates&one&pass&on&each&molecule& sequenced& Circular'Consensus'Sequencing' Small&Insert&Sizes& Generates&mul8ple&passes&on&each&molecule& sequenced&

17 Pacific Biosciences Raw reads and subreads Standard'Sequencing' Large&Insert&Sizes& Circular'Consensus'Sequencing' Generates&one&pass&on&each&molecule& sequenced& Subreads Small&Insert&Sizes& Generates&mul8ple&passes&on&each&molecule& sequenced&

18 Pacific Biosciences Raw read quality Standard'Sequencing' Large&Insert&Sizes& Generates&one&pass&on&each&molecule& sequenced& 85-87% accuracy Random errors (!) Circular'Consensus'Sequencing' Small&Insert&Sizes& Generates&mul8ple&passes&on&each&molecule& sequenced& 4 to 5 passes: accuracies in the high 90's % 5 passes yields average Q30 (1:1000 chance of error)

19 PacBio: raw reads

20 PacBio and de novo assembly Libraries Standard'Sequencing' Large&Insert&Sizes& Sizes Aim for looooong insert sizes

21 Longer reads? Repeat copy 1 Repeat copy 2 Long reads can span repeats and heterozygous regions Polymorphic con g 2 Con g 1 Con g 4 Polymorphic con g 3

22 PacBio: uses Long reads à low quality Standard'Sequencing' Large&Insert&Sizes& Sizes Generates&one&pass&on&each&molecule& sequenced& Single pass 85-87% accuracy Useful for assembly?

23 PacBio: long read uses For de novo For scaffolding h p://schatzlab.cshl.edu/presenta ons/ PAG.SMRTassembly.pdf

24 PacBio: long read uses For de novo à Error correct with short reads h p://schatzlab.cshl.edu/presenta ons/ PAG.SMRTassembly.pdf

25 Pacific Biosciences Raw read quality Standard'Sequencing' Large&Insert&Sizes& Generates&one&pass&on&each&molecule& sequenced& 85-87% accuracy Random errors (!) Error- correct using PacBioToCA and other programs

26 Pacific Biosciences Raw read quality Standard'Sequencing' Large&Insert&Sizes& Generates&one&pass&on&each&molecule& sequenced& 85-87% accuracy Random errors (!) 6000 Reads Using this to error- correct this Need x coverage! Read Length doi: /m9.figshare PacBioToCA, HGAP

27 PacBio: HGAP 6000 Reads Using& this && to&error-correct&& this&& Read Length doi: /m9.figshare & PacBio long reads Pre-assembled reads reads con gs scaffolds Hierarchical Genome Assembly Process (HGAP) Finished genome h p://

28 PacBio and base modifica ons

29 PacBio and base modifica ons

30 PacBio reads for Atlan c cod

31 Cod: PacBio reads 15x coverage Average raw read length: 3.6 kbp Longest 29 kbp 147 SMRT Cells Standard'Sequencing' Large&Insert&Sizes& Sizes

32 Cod: PacBio results Mapping to the published genome 11.4 kbp subread Kb Forward strand Vertebrate cdnas... 1,046,000 1,047,000 1,048,000 1,049,000 1,050,000 1,051,000 1,052,000 1,053,000 1,054,000 1,055,000 1,056,000 Unigene EST clus... Contigs BLAT/BLAST hits contig > contig > contig > contig > 10.6 kbp subread BLAT/BLAST hits Kb 426, , , , , , ,000 ENSGMOT ,000 > projected protein coding 442, ,000 Contigs BLAT/BLAST hits Ensembl gene < contig < contig < contig < ENSGMOT < ENSGMOT projected protein coding projected protein coding 10.9 kbp subread Kb Ensembl gene Contigs BLAT/BLAST hits 196, , , , , , ,000 ENSGMOT > projected protein coding < contig < contig < contig884351

33 Cod2: assembly programs Zhang et al. PLoSOne 2011

34 Cod2: assemblies Newbler: BAC ends Celera: Illumina Celera: raw (!) PacBio ALLPATHS- LG: Illumina

35 Con gs and scaffolds reads& con*gs& scaffolds& GGTTACCACGCGTAGCGCAT TTACCACGCGTAGCGCATTA ACCACGCGTAGCGCATTACA CACGCGTAGCGCATTACACA CGCGTAGCGCATTACACAGA CGTAGCGCATTACACAGA TAGCGCATTACACAGA GCGATTCAGGTTACCACGCGTAGCGCATTACACAGA Scaffold con g gap

36 N50 50% of the genome is in con gs as large as the N50 value 1000 bp genome Sum N50 Courtesy of Michael Schatz, CSHL

37 Cod2: assemblies gap( Scaffold( con*g( ( Scaffold(N50((Kbp)( 1600" 1400" 1200" 1000" 800" 600" 400" 200" 0" Goal Newbler (454) Celera (454 + Ilmn) con g ALLPATHS- LG (Ilmn) 0" 20" 40" 60" 80" 100" 120" Con3g(N50((Kbp)( con$g& Merged Celera (PacBio + 454)

38 Cod2: assemblies gap( Scaffold( con*g( ( Scaffold(N50((Kbp)( 1600" 1400" 1200" 1000" 800" 600" 400" 200" 0" Goal Newbler (454) Merged Cod1 Celera (454 + Ilmn) Celera (PacBio + 454) ALLPATHS- LG (Ilmn) 0" 20" 40" 60" 80" 100" 120" Con3g(N50((Kbp)( con$g&

39 Cod2: assemblies gap( Scaffold( con*g( ( Scaffold(N50((Kbp)( 1600" 1400" 1200" 1000" 800" 600" 400" 200" 0" Goal Newbler (454) Merged Cod1 Celera (454 + Ilmn) Celera (PacBio + 454) ALLPATHS- LG (Ilmn) 0" 20" 40" 60" 80" 100" 120" Con3g(N50((Kbp)( con$g&

40 Cod2: assemblies 35" 30" ALLPATHS- LG (Ilmn) gap$ %"gap"bases" 25" 20" 15" 10" 5" Newbler (454) Celera (454 + Ilmn) Goal Celera (PacBio + 454) Merged 0" 0" 20" 40" 60" 80" 100" 120" Con,g"N50"(Kbp)" con$g&

41 Cod2: assemblies gap$ %"gap"bases" 35" 30" 25" 20" 15" 10" 5" 0" Cod1 ALLPATHS- LG (Ilmn) Newbler (454) Celera (454 + Ilmn) Celera (PacBio + 454) Goal Merged 0" 20" 40" 60" 80" 100" 120" Con,g"N50"(Kbp)" con$g&

42 Cod2: assemblies gap$ %"gap"bases" 35" 30" 25" 20" 15" 10" 5" 0" Cod1 ALLPATHS- LG (Ilmn) Newbler (454) Celera (454 + Ilmn) Celera (PacBio + 454) Goal Merged 0" 20" 40" 60" 80" 100" 120" Con,g"N50"(Kbp)" con$g&

43 Cod2: gene space 457 of 458 conserved eukaryo c genes Vol. 23 no , pages BIOINFORMATICS ORIGINAL PAPER doi: /bioinformatics/btm071 Genome analysis CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes Genis Parra 1, Keith Bradnam 1 and Ian Korf 1,2, 1 2

44 Golden rule? No single assembler gives the op mal assembly

45 Cod2: merging assemblies Celera: Long con gs, short scaffolds, short gaps Scaffold con g gap Newbler: Short con gs, long scaffolds Scaffold con g gap Combined: Long con gs, long scaffolds Scaffold con g gap Slide modified a er Ole Kris an Tørresen

46 Cod2: adding PacBio Closed gap Reduced gap Using PBJelly PacBio reads Con g Scaffold Slide courtesy of Ole Kris an Tørresen

47 Cod2: polishing the assembly 454 and Illumina reads Scaffold Con g Slide courtesy of Ole Kris an Tørresen

48 Cod2: assemblies Scaffold(N50((Kbp)( 1600" 1400" 1200" 1000" 800" 600" 400" Goal Newbler (454) Celera (454 + Ilmn) Merged Celera (PacBio + 454) 200" 0" ALLPATHS- LG (Ilmn) 0" 20" 40" 60" 80" 100" 120" Con3g(N50((Kbp)( Wikimedia, user Dsmurat

49 Cod2: assemblies 35" 30" ALLPATHS- LG (Ilmn) %"gap"bases" 25" 20" 15" 10" 5" Newbler (454) Celera (454 + Ilmn) Goal Celera (PacBio + 454) Merged 0" 0" 20" 40" 60" 80" 100" 120" Con,g"N50"(Kbp)"

50 Cod2: conclusions A much be er assembly contig > contig > contig > < ENSGMOT protein coding Mul ple technologies needed

51 What s next

52 Aqua genome project Resequencing X X X X X X X X X 1000 salmon genomes 1000 cod genomes Wikipedia h p://fishandboat.com

53 More (fish) genomes 60+ fish genome project Mar n Malmstrøm several sources

54 Developments(in(High(Throughput(Sequencing( 1000$ Hiseq 2000/2500 Hiseq X Hiseq2500 RR 100$ NextSeq 500 Gigabses(per(run((log(scale)( 10$ 1$ 0.1$ 0.01$ SOLiD GA II PGM MiSeq Proton GS FLX GS Junior PacBio RS? ABI$3730xl$ Roche/454$GS$ Illumina$GA$ Series4$ SOLiD$ Illumina$MiSeq$ Ion$Torrent$PGM$ 0.001$ Series8$ 454$GS$Junior$ Ion$Proton$ $ Series11$ Sanger Illumina$Hiseq$X$ $ Lex Nederbragt (2014) h p://dx.doi.org/ /m9.figshare $ 100$ 1,000$ 10,000$ Read(length((log(scale)( Illumina$NextSeq$500$

55 Developments(in(High(Throughput(Sequencing( 1000$ Hiseq 2000/2500 Hiseq X Hiseq2500 RR 100$ NextSeq 500 Gigabses(per(run((log(scale)( 10$ 1$ 0.1$ 0.01$ SOLiD GA II PGM MiSeq Proton GS FLX GS Junior PacBio RS ABI$3730xl$ Roche/454$GS$ Illumina$GA$ Series4$ SOLiD$ Illumina$MiSeq$ Ion$Torrent$PGM$ 0.001$ Series8$ 454$GS$Junior$ Ion$Proton$ $ Series11$ Sanger Illumina$Hiseq$X$ $ Lex Nederbragt (2014) h p://dx.doi.org/ /m9.figshare $ 100$ 1,000$ 10,000$ Read(length((log(scale)( Illumina$NextSeq$500$

56 The Oxford Nanopore MinION: early experiences

57 Nanopore sequencing

58 Oxford Nanopore AGBT conference, February 2012 MinION 512 nanopores 150mb/hour Up to 6 hours $900

59 Oxford Nanopore MinION nanoporetech.com

60 Oxford Nanopore MinION Via h p://omicfron ers.com from h p://

61 Oxford Nanopore MinION From h p:// copyright uncertain

62 MinION data

63 MinION data ~ reads 161 Mbp Mean 5.5 kbp Accuracy comparable to Pacific Biosciences

64 MinION data

65 Next- genera on sequencing and bioinforma cs

66 Sequencing cost h p://

67 DNA sequencing (bp/$) Doubling mes 1,000, ,000 NGS (bp/$) Doubling time 5 months 100,000,000 10,000,000 1,000,000 Disk storage (Mbytes/$) 10,000 1, Pre-NGS (bp/$) Doubling time 19 months Hard disk storage (MB/$) Doubling time 14 months 100,000 10, Year h p://genomebiology.com/2010/11/5/207

68 Challenges Constant stream of new so ware h p://seqanswers.com/wiki/so ware

69 Challenges Constant stream of new so ware à hard to judge if programs are any good à some mes a challenge to install a program and get it working h p://neidetcher.com/ubuntu_package_dependency.html

70 Dependency hell h p://neidetcher.com/ubuntu_package_dependency.html

71 Dependency hell h p://en.wikipedia.org/wiki/dependency_hell h p://neidetcher.com/ubuntu_package_dependency.html

72 How to become a bioinforma cian

73 The command line Learn

74 Learn A programming language

75 How to use the internet Learn

76 Learn From each other

77 Learn A end a so ware Carpentry workshop h p://so ware- carpentry.org/

78 Shameless self- promo on flavors.me/flxlex

79 Read length ma ers 5.2 Mb circular genome, infinite error- free reads Roberts et al (2013) doi: /gb