Genome Assembly. Microbial Single Cell Genomics Workshop 2010 Sergey Koren JCVI, CBCB at UMD

Size: px
Start display at page:

Download "Genome Assembly. Microbial Single Cell Genomics Workshop 2010 Sergey Koren JCVI, CBCB at UMD"

Transcription

1 Genome Assembly Microbial Single Cell Genomics Workshop 2010 Sergey Koren JCVI, CBCB at UMD

2 Introduction

3 Platform: 3730 Model length reads bases ABI K Applied Biosystems 3730xl DNA amplification in E. coli. Sanger dye-terminator PCR. Capillary gel electrophoresis. 90 min run makes 96 reads. 500 t0 800 bp/read. Paired end range 2Kbp - 40Kbp. Mate rate almost 100%. Reads include vector at 5 end. Quality drops gradually at 3 end. Traces available for inspection. ABI 3730 Trace from 3730

4 Platform: FLX Model length reads bases ABI K 454 FLX 400 1M 400M Roche 454 FLX Titanium Successor to GS 20 and FLX Standard Pyrosequencing: Emulsion PCR, Bead capture, Transfer to picoliter plate, sequencing by synthesis of complementary strand, Image processing Paired-ends: Reads are cut from single circles (in lab), Size options are 3Kbp or 8Kbp or 20Kbp, Linker-positive reads cut into pairs by software, 50% of reads are linker-positive. Run time is 9hr. Clear ranges of 80bp-500bp. Problems: Inaccurate counts of repeated bases, duplicate reads, lower output on paired-end. 454 Beads on a picotitre plate

5 Platform: Illumina Model length reads bases ABI K 454 FLX 400 1M 400M Illumina GA M 20G Illumina Solexa Genome Analyzer Bridge hybridization on flat surface, Cluster formation, Sequencing by synthesis, Image Processing. One 2x run in 10 days. Lengths have grown from 35 to 2x100. Reads have identical length initially, trimmed by software, Substitution more than indel, error accumulates at read end. Paired ends, sequencing both ends of a fragment, range of 35bp-700bp. Mate pairs, sequencing fragment cut from circle, range of 2Kbp-8Kbp. Illumina Flow cell with 8 lanes

6 Platform: SOLiD Model length reads bases ABI K 454 FLX 400 1M 400M Illumina GA M 20G ABI SOLiD 50 >1G 50G Applied Biosystems SOLiD System 4 succeeds 1 thru 3. Sequencing by reversible ligation, 5-mers query 2 consecutive bases per luminous interrogation, light signal is di-base encoded. Up to 16 days per run. Reads vary from perfect to unusable due to high error. Color space output, requires special software, allows detection and correction of some errors, vendor claims higher accuracy. ABI SOLiD Colors represent di-base encoding

7 Emerging Platforms Helicos Method: Single molecule sequencing, 35bp reads Application: whole-genome resequencing Capacity: 50 channels per 8-day run, 50X coverage of one E.coli genome per channel PacBio Single molecule with polymerase Light detection from small chambers Runs are fast (15m) but low-throughput Reads of 10Kbp possible Current reads are short, high-error Ion Torrent Electronic chip Charge detection, no photonics Machine is compact and cheap ($50K) Runs are fast (1hr) and cheap ($500) Millions of reads/run, hundreds of bases/read

8 Illumina Output s_3_1_0083_qseq.txt! HWUSI-EAS TAAACAACCCCGGGCTCACCGAGCTGAGTCTGTAGC ^IV`J^JaT\F\D^DFN ORH\GYIM\SSTWXF\T 0! HWUSI-EAS CAGCAGAAGCAAGGCGGCTGCAGAGAGCAGAGAGAC X]\LaaJ`a\MZWWGP_]IZZWZa\IPFIMFPa_TX 0! HWUSI-EAS AACCTCCCTTGCCAGCTTCCCCCGATAGTTAATGGG ab\fhew]`rt`vwikpr^]qrx[gfziqfr]_bbb 0! HWUSI-EAS AGGCTGTTAACTGAGTATGGCCGCTCCTATTTGTTT WSabaaVIHYH[]MKUZBBBBBBBBBBBBBBBBBBB 0! HWUSI-EAS CGTTTAGCTCTATCTTACACCACTTCCTCTCCTTTC O HYJXI_^GZ`bbbb^RN]IGRVI`K\YGZUH\[ 0! HWUSI-EAS AGCGAAGGGGCAGTGAAGGGACGAAAAGCCGAGCAG ZGHZ_^^OO_^BBBBBBBBBBBBBBBBBBBBBBBBB 0! HWUSI-EAS TCCTCTCCGAACGGTTGACAGCGCTCAGCATTGTAC a]b_b_bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb 0! HWUSI-EAS TGCCCATTGGAGGAAAAAAAATTACATAGATCTCCA ab_agydmzx^]`z^]hwyd[_w\\hqwumhwwax_ 0! HWUSI-EAS GCCCCGGC.TACGAATGCGGTCAGCGGAAAAAGCAG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0! HWUSI-EAS CGGCAATT.TTCTTATCAACACCACCCGAACGACTG abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb 0! HWUSI-EAS AAATCTG..AATGATTCTCCCTTCTGACGTACCACA BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 Machine ID Lane 1 or 2 of pair Base calls Quality values run=090401! TAAACAACCCCGGGCTCACCGAGCTGAGTCTGTAGC! +HWUSI-EAS541:3:83:1787:876#0/1! ^IV`J^JaT\F\D^DFN run=090401! CAGCAGAAGCAAGGCGGCTGCAGAGAGCAGAGAGAC! +HWUSI-EAS541:3:83:1787:673:0#0/1! X]\LaaJ`a\MZWWGP_]IZZWZa\IPFIMFPa_TX! Alternate format, same data phred Quality Values, now used by Illumina QV=20 means 1/10 2 chance of error. QV=30 means 1/10 3 chance of error. QV=40 means 1/10 4 chance of error. Solexa Quality Values, formerly used by Illumina QV=2 is encoded as ascii( 2+64)= B. QV=20 is encoded as ascii(20+64)= T. QV=30 is encoded as ascii(30+64)= ^. QV=40 is encoded as ascii(40+64)= h. *64 is an arbitrary constant. Sanger files used 33.!

9 454 Output Example of using unix utilities from 454. Shown is sffinfo. See also sfffile.! $ sffinfo! Usage: sffinfo [options...] [- sfffile] [accno...]! Options:! -a or -accno Output just the accessions! -s or -seq Output just the sequences! -q or -qual Output just the quality scores! -f or -flow Output just the flowgrams! -t or -tab Output the seq/qual/flow as tab-delimited lines! -n or -notrim Output the untrimmed sequence or quality scores! Dump 1 read from SFF file for half-plate #2 Each flow yields 0 or more bases $ sffinfo FFPMSHM02.sff FFPMSHM02B382V! # of Flows: 400! # of Bases: 271! Clip Qual Left: 5! Clip Qual Right: 90 Flow Chars: TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACG! Flowgram: ! Bases: tcagggggggggaaccactgtaacaaaggaaagtaggttggtctcgcggtgagtg! Quality Scores: ! The first 4 bases (lower case) are the key sequence for calibration. On this slide, the 4 colors show correspondence between flows and bases.

10 Color Space ABI SOLiD The reads have colors not bases. encoded as numbers 0 to 3 ABI claims colors improve accuracy. Software should align in color space. convert to bases after alignment ABI software catalog is skimpy. A followed by T is coded red. T followed by G is coded green. Note T is interrogated twice. The color code.

11 Sanger-era Paired Ends 1. Randomly shear copies of the genome 2. Build libraries of similar-length fragments 3. Amplify individual fragments 4. Sequence two ends of each fragment Insert size estimate given as µ ± σ per library. µ ± σ Oriented reads with distance estimate.

12 Sanger and 454 Paired Ends Sanger sequencing of insert in vector 454 sequencing across linker Mate (blue) is same as read (brown) Mate (blue) is reverse complement of read (brown) Sanger mate pair from both forward sequences, vector trimmed Sanger-style mate pair from opposite strands, linker trimmed

13 Sequencer Summary Sequencing Machines ABI 3730xl 454 FLX with Titanium: variable-length reads Illumina GA (Solexa): pairs of 100bp reads ABI SOLiD system 4: reads of 50 colors Helicos Heliscope: 35bp from single molecule Pacific Biosciences: variable-length reads from single molecule Ion Torrent: not yet available Sequencing Approaches Whole-genome shotgun Mate pair and paired end Bar coding Contract sequencing

14 de novo Assemblers Greedy Graph Algorithms for short reads SSAKE, SHARCGS, VCAKE Traditional Overlap/Layout/Consensus Algorithms Developed for Sanger reads Celera Assembler (maintained at JCVI, for reads of 100bp+) Newbler (proprietary from Roche, tuned for 454 reads) Others: TIGR, phrap, Arachne, PCAP, Phusion, Atlas, Jazz de Bruijn Graph Algorithms EULER: originally developed for Sanger reads Velvet: the first widely used assembler for short reads AllPaths: version 3 may scale up to human genome ABySS: distributed memory, distributed processing, first assembly of human from short (<100bp) reads SOAP de novo: best assembly of human from short reads, running on large-ram computer CLC: commercial assembler designed for mixed platform fragments, uses many processors and low memory

15 Algorithms: Basic Steps Read processing Filter erroneous reads and trim erroneous bases Graph construction Reads and their pair-wise overlaps, or K-mers and their K-1 overlaps Graph reduction Remove spurs (erroneous read ends) Collapse bubbles (minor polymorphisms) Tease apart collapsed repeats (use reads or mates) Form contigs but break at repeats Scaffold with mate pairs Consensus from multiple sequence alignment

16 Assembly Components Consensus AGGCATGACGGCTAGGCGCGTANNNNNNNNNCCGCGAATACGAG Scaffold = Contigs + Gaps (gap lengths derived from pairings) Contigs = Unitigs + Pairing Unitigs = Reads + Overlaps (up to apparent repeat boundaries) Overlaps Reads & Pairing

17 Algorithms: Techniques Allowing for sequencing error complicates everything Use sequence similarity as guide and paired-ends as constraint Techniques for extra speed Use indexes to avoid all-against-all pair-wise alignment K-mers (exact match substrings) are suitable for fast computes In De Bruijn Graph assembly (Velvet), identical K-mers collapse to graph nodes In traditional assembly (Celera Assembler) K-mers seed overlaps Arbitrary thresholds Minimum overlap length or percent identity Maximum repetitiveness tolerated in a read Sub-optimal algorithms Greedy algorithms (keep enlarging my contig without concern for global picture) Heuristic algorithms (use coverage ratios that seem to work well on real data) Save consensus for last Multiple pair-wise alignments is faster than one multiple sequence alignment Must choose a strategy for reads from genomic repeats Report 0 reads/repeat (leave gap or provide consensus with 0X coverage) Report 1 read/repeat (chosen randomly for even coverage) Report all reads/repeat (repeat reads are multiply placed)

18 Algorithms: Graphs aaccgg ccggtt Assembly algorithms construct graphs, which are nodes connected by edges. In traditional assemblers (left) nodes are reads, edges are overlaps. In de Bruijn assemblers (right) nodes are K-mers, edges are exact matches of K-1. aacc accg ccgg cggt ggtt Assembly algorithms reduce graph complexity. Boxes are reads or K-mers. Edges are overlaps or matches. Edge thickness can indicate amount of support in reads. Left: A spur is induced by bad sequence at a read end or low coverage and polymorphism. Middle: A bubble is induced by polymorphism or sequencing error. Right: A collapsed repeat might get teased apart or isolated or multiply placed.

19 Coverage: Impact on Assembly Sequence this genome by the random shotgun method. 1. Initially, all new reads make contigs. 2. Later, new reads fill gaps. Contigs enlarge and coalesce. 3. Eventually, new reads add little to nothing. Gaps remain due to chance or bias. 4. At very high coverage, error accumulates, assemblies fracture (not shown).

20 Lander-Waterman The Lander-Waterman equations use Poisson statistics to describe an unbiased and unpaired whole-genome shotgun experiment on a non-repetitive genome assuming {detected overlaps}={true overlaps}. Your results may vary! Average Contig Length Total Number of Contigs Read coverage: 0X 8X Read coverage: 0X 8X Parameters: read length = 800, minimum detected overlap = 40. Early reads nucleate contigs. Later reads are mostly redundant.

21 Read-Length Effects 1. Overlap Effect: For same number of sequenced bases, shorter reads require more coverage to achieve comparable N50. Assembly 1. 9 reads of length = 30bp. Total sequenced bases = 270. Assemble with min overlap = 20bp. Result = 7 contigs. Same inputs Different outputs Assembly 2. 3 read length = 90. Total sequenced bases = 270. Assemble with min overlap = 20bp. Result = 1 contig. 2. Repeat Effect: Shorter reads resolve fewer repeats. Repeat length = 600bp. Read length = 800bp (Sanger). Reads span the repeats. Repeat length = 600bp. Read length = 400bp (454). Reads bridge the repeats. Repeat length = 600bp. Read length = 75bp (Solexa). Repeats not resolved.??

22 Mate Pair Effects 1. Variety of insert sizes will span variety of repeats. Inserts that span the repeat will enable scaffolds. High coverage in mates will tile the repeat. Larger repeats require larger insert sizes. 2. Mates can resolve repeats even if not possible to tile with reads. Mates guide multiple placement of repeat contig.

23 Assembly Metrics Number of contigs, number of scaffolds Average number of contigs per scaffold Total bases in scaffolds Total span of scaffolds (including the gap lengths) Average gap length (for intra-scaffold gaps) Contig N50, Scaffold N50, Gap N50 Sum of bases in contigs>10kbp Sum of span in scaffolds>2kbp Min, max, and mean span of a scaffold Min, max, and mean bases in a contig Fraction of paired-end constraints satisfied in scaffolds Average read coverage in contigs

24 The N50 Metric Problem: How to represent a distribution of intervals by one number? Solution: N50 1. Measure the total span. 2. Sort the set by size. 3. Start walking from the largest interval. 4. Stop walking at the 50% mark. 5. Measure the interval on which you stopped. Start at largest. Intervals may represent contigs, scaffolds, gaps, alignments, etc. 0 10% 30% 50% 70% 90% This interval is your N50. Extra credit: What is N90? Why is N50 unstable?

25 Typical Assembly Output Scaffolds.fasta Each scaffold contains one or more contigs. Gap lengths may be represented by strings of N. Contigs.fasta These are the contigs from scaffolds, without the gaps. These may be filtered by a minimum size criterion. UnassembledReads.fasta Solo reads may be called singletons. Their mates may or not be assembled. UnassembledContigs.fasta These are contigs not placed in scaffolds. May be repeats, plasmids, contamination, short contigs.

26 Common Assembly Errors Chimera Assembly joins unrelated sequences, usually across a low-copy repeat Collapsed tandem repeat Copy number is reduced in the assembly Missed join Contigs not assembled despite sequence similarity Assembly Forensics* * Phillippy, Schatz, Pop (2007) Genome assembly forensics: finding the elusive mis-assembly. Genome Biology.

27 Consensus Problems How does 1 sequencing error propagate to the consensus? GGCGT-AGCA GGCGT-AGCA GGCGTTAGCA GGCG-TAGCA GGCG-TAGCA GGCGTTAGCA These alignments were valid. So were these. Each consensus base has at least 3 bases supporting it. Even sophisticated software can generate alignments that the human eye recognizes as wrong. The root cause is that multiple sequence alignment is solved by a series of pair-wise alignments. Celera Assembler reviews its final alignments to compress gap-rich regions. Celera Assembler fixes this particular problem, but it misses more complex ones.

28 Single Cell Assembly Challenges

29 Uneven Coverage A 120K subset of a scaffold Middle (180K-220K) has high coverage Is this high coverage a repeat (and an assembly mistake) Is it due to amplification bias (and a correct assembly) Flavobacteria bacterium MS024-2A assembly.! Scaffold generated by Celera Assembler, displayed with Hawkeye viewer.!

30 Uneven Coverage Aligning to reference High coverage region ( K) has good alignment No discrepancy with reference The high coverage is due to amplification bias Flavobacteria bacterium MS024-2A assembly.! Scaffold generated by Celera Assembler, displayed with Hawkeye viewer.! Aligned to reference using MUMmer.!

31 Coverage in scaffolds including gaps and surrogates. Bar at 0 is truncated; value is 152.9M. Count at each position reflected sum of spanning reads, regardless of alignment gaps. (Bonobo i7 Assembly, Bonobo Consortium)

32 Uneven Coverage (cont) Flavobacteria bacterium MS024-2A assembly.! Scaffold generated by Celera Assembler!

33 Uneven Coverage (cont) Flavobacteria bacterium MS024-2A assembly.! Scaffold generated by Celera Assembler!

34 Chimera 85% of chimeras correspond to a sequence inversion (A, C). 15% of chimeras resulted from the joining of two segments in direct orientation (B, D). The order of the two segments could be reversed. The first segment in the chimera (black arrows) can join to a segment that is downstream (A, B) or upstream (C, D).

35 Bioinformatics Challenges

36 Assembly Tools Bioinformatics tools are often not user-friendly They must be compiled from code and give cryptic error messages Require Unix knowledge/it support

37 Bioinformatics Availability A collection of bioinformatics tools Can be run on the Amazon EC2 Designed to skip the complexity of finding/ downloading/building tools

38 Preparing Inputs to Assembly

39 Hands On: Preparing Data

40 Assembling with Celera Assembler

41 Hands On: Assembling with CA

42 Celera Assembler Output Assembly.scf.fasta >scf138 CTTAATTTTGGAATGAGCAAATCAAACGGCTCTCTTTCTTCTCTCTTCCGACGATCATTTCTCTCCTTCA TCCTTTTTCTCTATTGAATTTATGGTTGTGAGACACTTGCCTATGCCGTTCAAAATTTTACATTAGAGTG GATCTAAACCAAATTGGACACTGTGTTATCTGGTTTTAGCTTGAAGTAATTTCTAACAAGAATGGTTGCT TCAAAAAGACAAATATAAAGTGAAATTGATTGCCTCATTCTCTTCACCCTCATTCAAACTAAAATTGAAG Assembly.asm CTTAAAGAAAGAAGTAAAGGAATCAAAAATGGCAGCCTGCACGTATGAGATGACGAATTCTAAAATTTCA {MDI AGCTCTGCGTTATATTTATTTTGAGTTATACGTTATCTTAATTAATTGGGTTGCTATAATTATAACTTAT ref:(20kb,1) TTCGCTCCACCTAGATTTTTAACGGCCAAAGTGAGTTTAGTTGGTGGTATTAAAATGACTTTTTTATTTA Assembly.posmap.frgscf mea: AAAATCGAAGGTTTAGTTTTTACTATGCAATTAGTGTACCAAAAGAAATTTTTATACCTATTTAAATCAA Assembly.qc FPQMBNN02SL55F f std: [Scaffolds] FPMLNGL01BTUZ f min:1539 TotalScaffolds=4127 FO77AEH01COVYY f max:31474 TotalContigsInScaffolds=8653 FPKRC2D02RHQ3C r buc:28 MeanContigsPerScaffold=2.10 FPDPMZT01DTV0D f his: MinContigsPerScaffold=1 FPBG22E02Q51Y r 8 MaxContigsPerScaffold=73 FPMX3IK01CS2XJ f 0 TotalBasesInScaffolds= FPMLNGL02QFBKX f 1 MeanBasesInScaffolds=48621 FO6AT3P02QPVK r 52 MinBasesInScaffolds=446 FO6AT3P02TCI7F r 69 MaxBasesInScaffolds= FO77AEH01A5XKF r 59 N25ScaffoldBases= FO77AEH01CKY2Z r 104 N50ScaffoldBases= FO77AEH01BD0T r 146 N75ScaffoldBases= FO76IZV02R25OI f 85 ScaffoldAt = FO76IZV02P6FZH f 66 ScaffoldAt =1027 FPDPEIT01BA86D r

43 Dealing with Coverage Bias Shaded node is a repeat Links together distinct regions of a genome Most assemblers start by disconnecting repetitive nodes

44 Coverage in CA Celera Assembler uses the A-Statistic P(read starting at base) = arrival rate = α = P(k starts in ρ bases) is Binomial We approximate via Poisson with giving us For a 2-copy repeat, the arrival rate is doubled, giving us A-Statistic = log P(unique)/P(2-copy) = When A-Statistic is negative, consider node a repeat Read coverage is not perfect Sequencing bias due to GC-content Borderline cases Incorrect coverage expectation Polyploid genomes Metagenomes

45 Fixing Repeat Detection: Toggling Celera Assembler pipeline: default reads overlap unitig scaffold scaffolds Celera Assembler pipeline: toggling reads overlap unitig scaffold Find large repeat unitigs placed once in scaffolds. Mark them unique. scaffold scaffolds

46 Oil Palm - Improvement in Assembly N50 Toggling (flagging repeats as trusted) Default Biggest Contigs Biggest Scaffolds Satisfied Paired-End Unsatisfied Paired-End

47 Hands on: How To Toggle dotoggle At the end of a successful assembly, search for placed surrogates and toggle them to be unique unitigs. Re-run the assembly starting from scaffolder Either 0 or 1, default is 0 Toggling Useful Options Output toggleunitiglength Minimum length for a surrogate to be toggled. Any number >= 1, default is 2000 togglenuminstances Number of instances for a surrogate to be toggled. If 0 is specified, all non-singleton unitigs are toggled to unique status. Any number >=0, default is 1. In future version of CA, setting it to 0 will toggle all non-singleton unitigs >= toggleunitiglength Pre-toggling results: asm/9-teminator Post-toggling results: asm/10-toggled/9-terminator Same contents as from a regular assembly

48 Incorporating Finishing Reads Goals of finishing Close sequencing gaps by generating new reads Get higher coverage in low-coverage sections Generate reads to resolve repeat instances Manual finishing is expensive Finishing reads generated from PCR product or clone Constrain possible assembly: between PCR product ends or clone ends Formulate as a constraint the scaffolder should satisfy

49 Finishing Reads in CA Given the set of bounded finishing reads F, and WGS reads W Bounded algorithm Incorporate unitigs into gaps Place reads in repeat instances

50 Finishing Reads Basic Assembly Bounded Assembly (a) (b) (c) Celera Assembler s two algorithms for assembling shotgun reads with finishing reads. The basic method treats both read types equally. The bounded algorithm attempts to assemble finishing reads consistently with their bounding constraints. For each algorithm, the figure shows its construction of a scaffold from contigs (rectangles) with 2X in shotgun reads (black lines). Each finishing read (colored line) has a corresponding pair of PCR primer sites (arrows of same color). External to the scaffold is a unitig (grey area) deemed repetitive due to high coverage. (a) A mate pair constraint (curve) localizes one read and the unitig to this gap. Nevertheless, the basic algorithm cannot tile this gap with reads. The bounded algorithm localizes two finishing reads by their primer sites. The bounded algorithm does tile the gap with reads, enabling a more accurate consensus sequence. (b) The basic cannot localize the unitig or any reads to this gap. It does not close the gap. The bounded algorithm localizes the unitig by finishing reads and their primer sites. It tiles the gap with finishing reads from the unitig. (c) Both algorithms assemble finishing reads from a gap that is not a genomic repeat.

51 Results * Koren, Miller, Walenz, Sutton (2010) An algorithm for automated closure during assembly. BMC Bioinformatics.

52 Specifying Finishing Reads Special record in FRG file Only supported in FRG v2 Example of a PLC: {PLC } act:a frg:x frg:y frg:z Finishing read ID Boundary #1 ID Boundary #2 ID Reads X, Y, and Z must be defined in an {FRG X, Y and Z can appear in an {LKG Y (and Z) can be either the left or right bound

53 Hands On: Finishing Reads in CA

54 Assembling with Newbler

55 Hands On: Assembling with Newbler

56 Hands On: Finishing Reads in Newbler

57 Newbler Output 454Scaffolds.fna >scaffold00001 length=48466 tttgtctaaagtaaatcttagcttaaggcagtgatatcgatggattagaaaggatttgga TTGGGGCATTGGTGTAAAAGAAAAACCCTCCGCTTGGGAGGGCTAAATGGCTTtATTTTG TTGAACCTATAaTTTCCCAAATTTCAATAGCTTTAAAACCGTTTGGCATCAAATCGCCCA TCTTGGTTTTATTCGCAATAGAAGCAGGtAATGGAAATTCTCCTGCTAGATACTTCATAA 454Contigs.ace CGTTAGTCATTGAGCCAAAAGCGTCCCAAGTCAGTATTGTTGATGAGTCCTCGTAATATG AS 389 GAGCAATTcTTTAAAACCCGTCCTCTCTCAGTGtATTTtCAATGTTACAATCTTAGAAAA TCACATAGACGCGAGTTTTTAGTTATGAGTACTATTCTATTGAATTTCTATATTCTGAAG 454NewblerMetrics.txt 454ReadStatus.txt CO contig00001 GGGTAACTGATTTtATTTTTTtAAatACTCTATTAAAATTTTCTTGGGAATTAAACCCTG pairedreadstatus U tttgtctaaagtaaatcttagcttaaggcagtgatatcgatggattagaa ATTTTTCTACAAGTGCCACAATTGTATATTTATCTAAAAATTTATTTTGGATTAAATCAC numberwithbothmapped = 2124; aggatttggattggggcattggtgtaaaagaaaaaccctccgcttgggag numberwithoneunmapped = 69; GGCTAAATGGCTTt*ATTTTGTTGAACCTATAa*TTTCCCAAATTTCAAT numbermultiplymapped = 1; AGCTTTAAAACCGTTTGGCATCAAATCGCCCATCTTGGTTTTATTCGCAA numberwithbothunmapped = 23; TAGAAGCAGGt*AATGG*AAATTCT*CCTGCTAGATACTTCATAACGTTA library { GT*CATTGAGCCAAAAGCGTCCCAAGTCAGTATTGTTGATGAGTCCTCGT libraryname = "sanger3kb_fios"; EM9Q5SV01BCVA3 Singleton AATATGGAGCAATT*c*TTTAAAACCCGTCCTCTCTCAGTGt*ATTTtCA pairdistanceavg = ; ATGTTACAATCTTAGAAAATCACATAGACGCG*AGTTTTTAGTTATGAGT pairdistancedev = ; ACTATTCTATTGAATTTCTATATTCTGAAGGGGTAACTGATTTtATTTTT } TtAAa*t*ACTCTATTAAAATTTTCTTGGGAATTAAACCCTG*ATTTTTC scaffoldmetrics { EM9Q5SV01CKSJH Singleton TACAAGTGCCACAATTGTATATTTAT*CTAAAAATTTATTTTGGATTAAA numberofscaffolds = 86; TCACAAGCGAAATCTATTTTCAAAGAATTGATGTAAAATGGTGTAGTGAC numberofbases = ; ATTTAACTTTTGTTTGATAGAATCACGTATATCATTTTGAGTTATTCCTG avgscaffoldsize = 18218; TATCTACTGACAAATTTACTATAGAATAATTATTTTGAAGGAATGCTTTT N50ScaffoldSize = ; GTTTCTTTAAGGTAATGGTTAAaGTTTtGAATTAATTCCTT largestscaffoldsize = ; EM9Q5SV01A4E99 Assembled contig contig EM9Q5SV01ECMLE Assembled contig contig EM9Q5SV01A835M Assembled contig contig EM9Q5SV01APR37 Assembled contig contig EM9Q5SV01CUR52 Assembled contig contig EM9Q5SV01CMCSJ Assembled contig contig EM9Q5SV01CCIN0 Assembled contig contig EM9Q5SV01BXHGV Assembled contig contig EM9Q5SV01B8QV8 Assembled contig contig EM9Q5SV01B5IA1 Assembled contig contig EM9Q5SV01CMFK1 Assembled contig contig EM9Q5SV01EORKA Assembled contig contig EM9Q5SV01A44JD Assembled contig contig EM9Q5SV01CS1HJ Assembled contig contig EM9Q5SV01A8ZAK Assembled contig contig EM9Q5SV01DGAKL Assembled contig contig EM9Q5SV01AVH7L PartiallyAssembled contig contig EM9Q5SV01AX7FC Assembled contig contig EM9Q5SV01EDHFV Assembled contig contig EM9Q5SV01ANK9N Assembled contig contig EM9Q5SV01AW150 Assembled contig contig EM9Q5SV01AZYCP Assembled contig contig EM9Q5SV01AWB3A Assembled contig contig EM9Q5SV01EABU1 Assembled contig contig EM9Q5SV01AVZ12 Assembled contig contig EM9Q5SV01EVB54 Assembled contig contig EM9Q5SV01CFUI9 Assembled contig contig EM9Q5SV01BH1OF Assembled contig contig EM9Q5SV01CI83R Assembled contig contig EM9Q5SV01CQC2E Assembled contig contig

58 Assembly Analysis

59 Assembly Alignments Forward strand Reverse strand Assembly Reference Alignment of a problematic bacterial assembly. Note double coverage at most reference positions. Further analysis revealed the presence of a second strain whose sequence similarity to reference was lower. Alignment by nucmer. Drawing by mummerplot. Both part of the mummer package (

60 Two Assemblies from 454 Cucumber 27M Titanium reads 3K, 8K, 20K inserts 199M in scaffolds 797K scaffold N50 87K contig N50 Read fates: 65% contig 31% degen 1% singleton Bonobo 250M Titanium reads 3K, 8K, 20K inserts 2.9G in scaffolds 9.6M scaff N50 67K contig N50 Read fates: 88% contig 7% degen 3% singleton Cucumber assembly by CA 5.3. Consortium led by 454 Life Sciences. Bonobo assembly by CA 5.4. Consortium led by Max Planck Institute.

61 Cucumber Reads 3Kbp library 20Kbp library Unpaired library * Read lengths equal the clear ranges reported by sffinfo. For mated reads, the read length includes both mates and the linker. The data was produced at 454 on the XLR Titanium.

62 Cucumber Mates Kbp library Count (log scale) Kbp library Separation (Kbp) * The mate separation was determined by Celera Assembler on data provided by 454.

63 Library Effect Large-insert libraries are overrepresented in degenerates. Large-insert reads were more often rejected for being too short or perfect prefix.

64 Cucumber Comparison Celera Assembler Newbler Contigs, total bases (Mbp) Largest contig (Kbp) Contig N50 (Kbp)* Contig mean (Kbp)* Scaffolds, total span (Mbp)* Largest scaffold span (Mbp) * Not apples-to-apples! N50 depends on assembler s genome size estimate. Mean depends on small-contig threshold. Total span depends on small-scaffold threshold.

65 Cucumber assembly Max N10 N10 N25 N25 N50 N50 N95 N95 Total Combined length count length count length count length count length count length Scaffolds 4,649, ,717, ,506, , ,997 3, ,566,885 Alignments 3,813, ,292, ,098, , ,463 3, ,183,263 Contigs 522, , , ,421 3,121 10,379 7, ,010,660 Alignments of CA assembly to the Newbler assembly approach the scaffold size, which is the maximum.

66 Cucumber Degens A degenerate is Two or more reads in a unitig (high-confidence contig) That could not be promoted to contig Negative A-stat due to short length or high coverage That could not be incorporated into scaffolds Would need 2 consistent mates at least Lots of small degenerates Number of degenerates= 665,311 Combined length= 332 Mbp Equivalent to 165% of scaffold bases! Largest= 21 Kbp. Average length= 499 bp Lots of reads in degenerates Reads in degenerate= 40% of all reads Reads in degenerate and so is mate= 2.9% of reads Non-degen reads with degen mate= 0.3% of reads

67 One scaffold, aligned to itself. Select scaffold with most degen alignments. Scaffold : 7 contigs (5 shown). Length= 96Kbp (first 50K shown). Alignments by nucmer (mummer) reveal no intra-scaffold repeats. Red: forward strand. Green: approximate contig layout.

68 Degenerates aligned to the scaffold. Select degens with fewer than 2 mated reads. Of those, select degens whose reads have pair-wise overlaps to reads in this scaffold. There are 646 such degens, more than for any other scaffold. Align with nucmer, minimum 95% identity. Some degens have multiple alignments. Red: forward strand. Blue: reverse strand. Green: approximate contig layout.

69 Cucumber Improvement i1 i2 i3 i4 Contigs, total bases (Mbp) Largest contig (Kbp) Contig N50 (Kbp) Contig mean (Kbp) Scaffolds, total span (Mbp) Largest scaffold span (Mbp) Degen bases (vs scaffolds) 165% 165% 132% 88% Degen reads 40% 40% 38% 31%

70 Illumina Assemblies The NAT experiment at JCVI Bacterial DNA from E. coli and Y. pestis Test Illumina 100bp PE at various coverage Test addition of 454 fragment or PE Test all assemblers

71 Cumulative Contig Length (bp)

72 Results: Performance Metrics For E. coli Assemblies 1600 %CPU Used x trim 50x trim 100x trim Readset 150x trim CLC illumina Velvet illumina SOAP illumina MIRA illumina Virtual Memory (kb) Log scale EULER illumina x trim 50x trim 100x trim Readset 150x trim CLC illumina Velvet illumina SOAP illumina MIRA illumina EULER illumina RAM (kb) log scale CLC illumina Velvet illumina SOAP illumina MIRA illumina EULER illumina x trim 50x trim 100x trim 150x trim

73 Hands On: Assembly Comparison

74 Assembly Viewers

75 Assembly Error? Green: mate pairs satisfied by the assembly.! Coverage discontinuity.! Prochlorococcus from GOS III metagenomic assembly.! Scaffold generated by Celera Assembler, displayed with Hawkeye viewer.! Investigation by Doug Rusch, JCVI.!

76 Another Viewpoint 100% identity! Color: collection site.! 50% identity! Recruitment discontinuity! corresponds to 16s-23S rrna operon! Prochlorococcus fragment recruitment to scaffold.! GOS reads aligned to GOS III assembly, displayed with JCVI Fragment Recruitment Viewer.! Investigation by Doug Rusch, JCVI.!

77 Chimera Detected By Viewers

78 Introduction to Viewers: Consed

79 Introduction to Viewers: Hawkeye

80 Hands On: Viewing Using Hawkeye

81 Introduction to Viewers: Gap4

82 Introduction to Viewers: Gap5

83 Hands On: Viewing Using Gap4/5

84 Questions

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es Sequencing project Unknown sequence { experimental evidence result read 1 read 4 read 2 read 5 read 3 read 6 read 7 Computational requirements

More information

De Novo Assembly of High-throughput Short Read Sequences

De Novo Assembly of High-throughput Short Read Sequences De Novo Assembly of High-throughput Short Read Sequences Chuming Chen Center for Bioinformatics and Computational Biology (CBCB) University of Delaware NECC Third Skate Genome Annotation Workshop May 23,

More information

Alignment and Assembly

Alignment and Assembly Alignment and Assembly Genome assembly refers to the process of taking a large number of short DNA sequences and putting them back together to create a representation of the original chromosomes from which

More information

AMOS Assembly Validation and Visualization

AMOS Assembly Validation and Visualization AMOS Assembly Validation and Visualization Michael Schatz Center for Bioinformatics and Computational Biology University of Maryland August 13, 2006 University of Hawaii Outline AMOS Validation Pipeline

More information

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club De novo assembly of human genomes with massively parallel short read sequencing Mikk Eelmets Journal Club 06.04.2010 Problem DNA sequencing technologies: Sanger sequencing (500-1000 bp) Next-generation

More information

Mate-pair library data improves genome assembly

Mate-pair library data improves genome assembly De Novo Sequencing on the Ion Torrent PGM APPLICATION NOTE Mate-pair library data improves genome assembly Highly accurate PGM data allows for de Novo Sequencing and Assembly For a draft assembly, generate

More information

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Sequence Assembly and Alignment. Jim Noonan Department of Genetics Sequence Assembly and Alignment Jim Noonan Department of Genetics james.noonan@yale.edu www.yale.edu/noonanlab The assembly problem >>10 9 sequencing reads 36 bp - 1 kb 3 Gb Outline Basic concepts in genome

More information

Next Generation Sequencing Lecture Saarbrücken, 19. March Sequencing Platforms

Next Generation Sequencing Lecture Saarbrücken, 19. March Sequencing Platforms Next Generation Sequencing Lecture Saarbrücken, 19. March 2012 Sequencing Platforms Contents Introduction Sequencing Workflow Platforms Roche 454 ABI SOLiD Illumina Genome Anlayzer / HiSeq Problems Quality

More information

Matthew Tinning Australian Genome Research Facility. July 2012

Matthew Tinning Australian Genome Research Facility. July 2012 Next-Generation Sequencing: an overview of technologies and applications Matthew Tinning Australian Genome Research Facility July 2012 History of Sequencing Where have we been? 1869 Discovery of DNA 1909

More information

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014 Introduction to metagenome assembly Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014 Sequencing specs* Method Read length Accuracy Million reads Time Cost per M 454

More information

DNA Sequencing and Assembly

DNA Sequencing and Assembly DNA Sequencing and Assembly CS 262 Lecture Notes, Winter 2016 February 2nd, 2016 Scribe: Mark Berger Abstract In this lecture, we survey a variety of different sequencing technologies, including their

More information

Contact us for more information and a quotation

Contact us for more information and a quotation GenePool Information Sheet #1 Installed Sequencing Technologies in the GenePool The GenePool offers sequencing service on three platforms: Sanger (dideoxy) sequencing on ABI 3730 instruments Illumina SOLEXA

More information

10/20/2009 Comp 590/Comp Fall

10/20/2009 Comp 590/Comp Fall Lecture 14: DNA Sequencing Study Chapter 8.9 10/20/2009 Comp 590/Comp 790-90 Fall 2009 1 DNA Sequencing Shear DNA into millions of small fragments Read 500 700 nucleotides at a time from the small fragments

More information

Aaron Liston, Oregon State University Botany 2012 Intro to Next Generation Sequencing Workshop

Aaron Liston, Oregon State University Botany 2012 Intro to Next Generation Sequencing Workshop Output (bp) Aaron Liston, Oregon State University Growth in Next-Gen Sequencing Capacity 3.5E+11 2002 2004 2006 2008 2010 3.0E+11 2.5E+11 2.0E+11 1.5E+11 1.0E+11 Adapted from Mardis, 2011, Nature 5.0E+10

More information

CSE182-L16. LW statistics/assembly

CSE182-L16. LW statistics/assembly CSE182-L16 LW statistics/assembly Silly Quiz Who are these people, and what is the occasion? Genome Sequencing and Assembly Sequencing A break at T is shown here. Measuring the lengths using electrophoresis

More information

Outline. The types of Illumina data Methods of assembly Repeats Selecting k-mer size Assembly Tools Assembly Diagnostics Assembly Polishing

Outline. The types of Illumina data Methods of assembly Repeats Selecting k-mer size Assembly Tools Assembly Diagnostics Assembly Polishing Illumina Assembly 1 Outline The types of Illumina data Methods of assembly Repeats Selecting k-mer size Assembly Tools Assembly Diagnostics Assembly Polishing 2 Illumina Sequencing Paired end Illumina

More information

Next generation sequencing techniques" Toma Tebaldi Centre for Integrative Biology University of Trento

Next generation sequencing techniques Toma Tebaldi Centre for Integrative Biology University of Trento Next generation sequencing techniques" Toma Tebaldi Centre for Integrative Biology University of Trento Mattarello September 28, 2009 Sequencing Fundamental task in modern biology read the information

More information

Lecture 14: DNA Sequencing

Lecture 14: DNA Sequencing Lecture 14: DNA Sequencing Study Chapter 8.9 10/17/2013 COMP 465 Fall 2013 1 Shear DNA into millions of small fragments Read 500 700 nucleotides at a time from the small fragments (Sanger method) DNA Sequencing

More information

BIOINFORMATICS ORIGINAL PAPER

BIOINFORMATICS ORIGINAL PAPER BIOINFORMATICS ORIGINAL PAPER Vol. 27 no. 21 2011, pages 2957 2963 doi:10.1093/bioinformatics/btr507 Genome analysis Advance Access publication September 7, 2011 : fast length adjustment of short reads

More information

Next Generation Sequences & Chloroplast Assembly. 8 June, 2012 Jongsun Park

Next Generation Sequences & Chloroplast Assembly. 8 June, 2012 Jongsun Park Next Generation Sequences & Chloroplast Assembly 8 June, 2012 Jongsun Park Table of Contents 1 History of Sequencing Technologies 2 Genome Assembly Processes With NGS Sequences 3 How to Assembly Chloroplast

More information

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015 Genome Assembly J Fass UCD Genome Center Bioinformatics Core Friday September, 2015 From reads to molecules What s the Problem? How to get the best assemblies for the smallest expense (sequencing) and

More information

De novo genome assembly with next generation sequencing data!! "

De novo genome assembly with next generation sequencing data!! De novo genome assembly with next generation sequencing data!! " Jianbin Wang" HMGP 7620 (CPBS 7620, and BMGN 7620)" Genomics lectures" 2/7/12" Outline" The need for de novo genome assembly! The nature

More information

We begin with a high-level overview of sequencing. There are three stages in this process.

We begin with a high-level overview of sequencing. There are three stages in this process. Lecture 11 Sequence Assembly February 10, 1998 Lecturer: Phil Green Notes: Kavita Garg 11.1. Introduction This is the first of two lectures by Phil Green on Sequence Assembly. Yeast and some of the bacterial

More information

NEXT GENERATION SEQUENCING. Farhat Habib

NEXT GENERATION SEQUENCING. Farhat Habib NEXT GENERATION SEQUENCING HISTORY HISTORY Sanger Dominant for last ~30 years 1000bp longest read Based on primers so not good for repetitive or SNPs sites HISTORY Sanger Dominant for last ~30 years 1000bp

More information

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme Illumina (Solexa) Current market leader Based on sequencing by synthesis Current read length 100-150bp Paired-end easy, longer matepairs harder Error ~0.1% Mismatch errors dominate Throughput: 4 Tbp in

More information

Functional Genomics Research Stream. Research Meetings: November 2 & 3, 2009 Next Generation Sequencing

Functional Genomics Research Stream. Research Meetings: November 2 & 3, 2009 Next Generation Sequencing Functional Genomics Research Stream Research Meetings: November 2 & 3, 2009 Next Generation Sequencing Current Issues Research Meetings: Meet with me this Thursday or Friday. (bring laboratory notebook

More information

Finishing of Fosmid 1042D14. Project 1042D14 is a roughly 40 kb segment of Drosophila ananassae

Finishing of Fosmid 1042D14. Project 1042D14 is a roughly 40 kb segment of Drosophila ananassae Schefkind 1 Adam Schefkind Bio 434W 03/08/2014 Finishing of Fosmid 1042D14 Abstract Project 1042D14 is a roughly 40 kb segment of Drosophila ananassae genomic DNA. Through a comprehensive analysis of forward-

More information

From Infection to Genbank

From Infection to Genbank From Infection to Genbank How a pathogenic bacterium gets its genome to NCBI Torsten Seemann VLSCI - Life Sciences Computation Centre - Genomics Theme - Lab Meeting - Friday 27 April 2012 The steps 1.

More information

The Journey of DNA Sequencing. Chromosomes. What is a genome? Genome size. H. Sunny Sun

The Journey of DNA Sequencing. Chromosomes. What is a genome? Genome size. H. Sunny Sun The Journey of DNA Sequencing H. Sunny Sun What is a genome? Genome is the total genetic complement of a living organism. The nuclear genome comprises approximately 3.2 * 10 9 nucleotides of DNA, divided

More information

A Guide to Consed Michelle Itano, Carolyn Cain, Tien Chusak, Justin Richner, and SCR Elgin.

A Guide to Consed Michelle Itano, Carolyn Cain, Tien Chusak, Justin Richner, and SCR Elgin. 1 A Guide to Consed Michelle Itano, Carolyn Cain, Tien Chusak, Justin Richner, and SCR Elgin. Main Window Figure 1. The Main Window is the starting point when Consed is opened. From here, you can access

More information

Next Generation Sequencing. Tobias Österlund

Next Generation Sequencing. Tobias Österlund Next Generation Sequencing Tobias Österlund tobiaso@chalmers.se NGS part of the course Week 4 Friday 13/2 15.15-17.00 NGS lecture 1: Introduction to NGS, alignment, assembly Week 6 Thursday 26/2 08.00-09.45

More information

Next Generation Sequencing. Jeroen Van Houdt - Leuven 13/10/2017

Next Generation Sequencing. Jeroen Van Houdt - Leuven 13/10/2017 Next Generation Sequencing Jeroen Van Houdt - Leuven 13/10/2017 Landmarks in DNA sequencing 1953 Discovery of DNA double helix structure 1977 A Maxam and W Gilbert "DNA seq by chemical degradation" F Sanger"DNA

More information

Next-Generation Sequencing. Technologies

Next-Generation Sequencing. Technologies Next-Generation Next-Generation Sequencing Technologies Sequencing Technologies Nicholas E. Navin, Ph.D. MD Anderson Cancer Center Dept. Genetics Dept. Bioinformatics Introduction to Bioinformatics GS011062

More information

Third Generation Sequencing

Third Generation Sequencing Third Generation Sequencing By Mohammad Hasan Samiee Aref Medical Genetics Laboratory of Dr. Zeinali History of DNA sequencing 1953 : Discovery of DNA structure by Watson and Crick 1973 : First sequence

More information

Improving Genome Assemblies without Sequencing

Improving Genome Assemblies without Sequencing Improving Genome Assemblies without Sequencing Michael Schatz April 25, 2005 TIGR Bioinformatics Seminar Assembly Pipeline Overview 1. Sequence shotgun reads 2. Call Bases 3. Trim Reads 4. Assemble phred/tracetuner/kb

More information

Assembly and Validation of Large Genomes from Short Reads Michael Schatz. March 16, 2011 Genome Assembly Workshop / Genome 10k

Assembly and Validation of Large Genomes from Short Reads Michael Schatz. March 16, 2011 Genome Assembly Workshop / Genome 10k Assembly and Validation of Large Genomes from Short Reads Michael Schatz March 16, 2011 Genome Assembly Workshop / Genome 10k A Brief Aside 4.7GB / disc ~20 discs / 1G Genome X 10,000 Genomes = 1PB Data

More information

Assembling a Cassava Transcriptome using Galaxy on a High Performance Computing Cluster

Assembling a Cassava Transcriptome using Galaxy on a High Performance Computing Cluster Assembling a Cassava Transcriptome using Galaxy on a High Performance Computing Cluster Aobakwe Matshidiso Supervisor: Prof Chrissie Rey Co-Supervisor: Prof Scott Hazelhurst Next Generation Sequencing

More information

Overview of Next Generation Sequencing technologies. Céline Keime

Overview of Next Generation Sequencing technologies. Céline Keime Overview of Next Generation Sequencing technologies Céline Keime keime@igbmc.fr Next Generation Sequencing < Second generation sequencing < General principle < Sequencing by synthesis - Illumina < Sequencing

More information

Biol 478/595 Intro to Bioinformatics

Biol 478/595 Intro to Bioinformatics Biol 478/595 Intro to Bioinformatics September M 1 Labor Day 4 W 3 MG Database Searching Ch. 6 5 F 5 MG Database Searching Hw1 6 M 8 MG Scoring Matrices Ch 3 and Ch 4 7 W 10 MG Pairwise Alignment 8 F 12

More information

2nd (Next) Generation Sequencing 2/2/2018

2nd (Next) Generation Sequencing 2/2/2018 2nd (Next) Generation Sequencing 2/2/2018 Why do we want to sequence a genome? - To see the sequence (assembly) To validate an experiment (insert or knockout) To compare to another genome and find variations

More information

How is genome sequencing done?

How is genome sequencing done? Click here to view Roche 454 Sequencing Genome Sequence FLX available at www.ssllc.com>> How is genome sequencing done? Using 454 Sequencing on the Genome Sequencer FLX System, DNA from a genome is converted

More information

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter A shotgun introduction to sequence assembly (with Velvet) MCB 247 - Brem, Eisen and Pachter Hot off the press January 27, 2009 06:00 AM Eastern Time llumina Launches Suite of Next-Generation Sequencing

More information

ChIP-seq and RNA-seq

ChIP-seq and RNA-seq ChIP-seq and RNA-seq Biological Goals Learn how genomes encode the diverse patterns of gene expression that define each cell type and state. Protein-DNA interactions (ChIPchromatin immunoprecipitation)

More information

Introduction to Next Generation Sequencing

Introduction to Next Generation Sequencing The Sequencing Revolution Introduction to Next Generation Sequencing Dena Leshkowitz,WIS 1 st BIOmics Workshop High throughput Short Read Sequencing Technologies Highly parallel reactions (millions to

More information

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Monday September 15, 2014

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Monday September 15, 2014 High Throughput Sequencing Technologies J Fass UCD Genome Center Bioinformatics Core Monday September 15, 2014 Sequencing Explosion www.genome.gov/sequencingcosts http://t.co/ka5cvghdqo Sequencing Explosion

More information

CSCI2950-C DNA Sequencing and Fragment Assembly

CSCI2950-C DNA Sequencing and Fragment Assembly CSCI2950-C DNA Sequencing and Fragment Assembly Lecture 2: Sept. 7, 2010 http://cs.brown.edu/courses/csci2950-c/ DNA sequencing How we obtain the sequence of nucleotides of a species 5 3 ACGTGACTGAGGACCGTG

More information

Genome Sequencing. I: Methods. MMG 835, SPRING 2016 Eukaryotic Molecular Genetics. George I. Mias

Genome Sequencing. I: Methods. MMG 835, SPRING 2016 Eukaryotic Molecular Genetics. George I. Mias Genome Sequencing I: Methods MMG 835, SPRING 2016 Eukaryotic Molecular Genetics George I. Mias Department of Biochemistry and Molecular Biology gmias@msu.edu Sequencing Methods Cost of Sequencing Wetterstrand

More information

ChIP-seq and RNA-seq. Farhat Habib

ChIP-seq and RNA-seq. Farhat Habib ChIP-seq and RNA-seq Farhat Habib fhabib@iiserpune.ac.in Biological Goals Learn how genomes encode the diverse patterns of gene expression that define each cell type and state. Protein-DNA interactions

More information

De novo meta-assembly of ultra-deep sequencing data

De novo meta-assembly of ultra-deep sequencing data De novo meta-assembly of ultra-deep sequencing data Hamid Mirebrahim 1, Timothy J. Close 2 and Stefano Lonardi 1 1 Department of Computer Science and Engineering 2 Department of Botany and Plant Sciences

More information

Lecture 7. Next-generation sequencing technologies

Lecture 7. Next-generation sequencing technologies Lecture 7 Next-generation sequencing technologies Next-generation sequencing technologies General principles of short-read NGS Construct a library of fragments Generate clonal template populations Massively

More information

De Novo and Hybrid Assembly

De Novo and Hybrid Assembly On the PacBio RS Introduction The PacBio RS utilizes SMRT technology to generate both Continuous Long Read ( CLR ) and Circular Consensus Read ( CCS ) data. In this document, we describe sequencing the

More information

BIOINFORMATICS 1 SEQUENCING TECHNOLOGY. DNA story. DNA story. Sequencing: infancy. Sequencing: beginnings 26/10/16. bioinformatic challenges

BIOINFORMATICS 1 SEQUENCING TECHNOLOGY. DNA story. DNA story. Sequencing: infancy. Sequencing: beginnings 26/10/16. bioinformatic challenges BIOINFORMATICS 1 or why biologists need computers SEQUENCING TECHNOLOGY bioinformatic challenges http://www.bioinformatics.uni-muenster.de/teaching/courses-2012/bioinf1/index.hbi Prof. Dr. Wojciech Makałowski"

More information

De novo whole genome assembly

De novo whole genome assembly De novo whole genome assembly Lecture 1 Qi Sun Minghui Wang Bioinformatics Facility Cornell University DNA Sequencing Platforms Illumina sequencing (100 to 300 bp reads) Overlapping reads ~180bp fragment

More information

Phenotype analysis: biological-biochemical analysis. Genotype analysis: molecular and physical analysis

Phenotype analysis: biological-biochemical analysis. Genotype analysis: molecular and physical analysis 1 Genetic Analysis Phenotype analysis: biological-biochemical analysis Behaviour under specific environmental conditions Behaviour of specific genetic configurations Behaviour of progeny in crosses - Genotype

More information

Chapter 7. DNA Microarrays

Chapter 7. DNA Microarrays Bioinformatics III Structural Bioinformatics and Genome Analysis Chapter 7. DNA Microarrays 7.9 Next Generation Sequencing 454 Sequencing Solexa Illumina Solid TM System Sequencing Process of determining

More information

Next-generation sequencing and quality control: An introduction 2016

Next-generation sequencing and quality control: An introduction 2016 Next-generation sequencing and quality control: An introduction 2016 s.schmeier@massey.ac.nz http://sschmeier.com/bioinf-workshop/ Overview Typical workflow of a genomics experiment Genome versus transcriptome

More information

Phenotype analysis: biological-biochemical analysis. Genotype analysis: molecular and physical analysis

Phenotype analysis: biological-biochemical analysis. Genotype analysis: molecular and physical analysis 1 Genetic Analysis Phenotype analysis: biological-biochemical analysis Behaviour under specific environmental conditions Behaviour of specific genetic configurations Behaviour of progeny in crosses - Genotype

More information

Genome Sequencing-- Strategies

Genome Sequencing-- Strategies Genome Sequencing-- Strategies Bio 4342 Spring 04 What is a genome? A genome can be defined as the entire DNA content of each nucleated cell in an organism Each organism has one or more chromosomes that

More information

Course summary. Today. PCR Polymerase chain reaction. Obtaining molecular data. Sequencing. DNA sequencing. Genome Projects.

Course summary. Today. PCR Polymerase chain reaction. Obtaining molecular data. Sequencing. DNA sequencing. Genome Projects. Goals Organization Labs Project Reading Course summary DNA sequencing. Genome Projects. Today New DNA sequencing technologies. Obtaining molecular data PCR Typically used in empirical molecular evolution

More information

Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome

Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome Ruth Howe Bio 434W 27 February 2010 Abstract The fourth or dot chromosome of Drosophila species is composed primarily of highly condensed,

More information

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es Sequencing technologies Jose Blanca COMAV institute bioinf.comav.upv.es Outline Sequencing technologies: Sanger 2nd generation sequencing: 3er generation sequencing: 454 Illumina SOLiD Ion Torrent PacBio

More information

Supplementary Figure 1 Schematic view of phasing approach. A sequence-based schematic view of the serial compartmentalization approach.

Supplementary Figure 1 Schematic view of phasing approach. A sequence-based schematic view of the serial compartmentalization approach. Supplementary Figure 1 Schematic view of phasing approach. A sequence-based schematic view of the serial compartmentalization approach. First, barcoded primer sequences are attached to the bead surface

More information

SMRT Analysis Barcoding Overview (v6.0.0)

SMRT Analysis Barcoding Overview (v6.0.0) SMRT Analysis Barcoding Overview (v6.0.0) Introduction This document applies to PacBio RS II and Sequel Systems using SMRT Link v6.0.0. Note: For information on earlier versions of SMRT Link, see the document

More information

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Monday June 16, 2014

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Monday June 16, 2014 High Throughput Sequencing Technologies J Fass UCD Genome Center Bioinformatics Core Monday June 16, 2014 Sequencing Explosion www.genome.gov/sequencingcosts http://t.co/ka5cvghdqo Sequencing Explosion

More information

PERGA: A Paired-End Read Guided De Novo Assembler for Extending Contigs Using SVM and Look Ahead Approach

PERGA: A Paired-End Read Guided De Novo Assembler for Extending Contigs Using SVM and Look Ahead Approach Title for Extending Contigs Using SVM and Look Ahead Approach Author(s) Zhu, X; Leung, HCM; Chin, FYL; Yiu, SM; Quan, G; Liu, B; Wang, Y Citation PLoS ONE, 2014, v. 9 n. 12, article no. e114253 Issued

More information

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis Data Basics Josef K Vogt Slides by: Simon Rasmussen 2017 Generalized NGS analysis Sample prep & Sequencing Data size Main data reductive steps SNPs, genes, regions Application Assembly: Compare Raw Pre-

More information

ABSTRACT. Genome Assembly:

ABSTRACT. Genome Assembly: ABSTRACT Title of dissertation: Genome Assembly: Novel Applications by Harnessing Emerging Sequencing Technologies and Graph Algorithms Sergey Koren, Doctor of Philosophy, 2012 Dissertation directed by:

More information

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es Sequencing technologies Jose Blanca COMAV institute bioinf.comav.upv.es Outline Sequencing technologies: Sanger 2nd generation sequencing: 3er generation sequencing: 454 Illumina SOLiD Ion Torrent PacBio

More information

High Throughput Sequencing Technologies. UCD Genome Center Bioinformatics Core Monday 15 June 2015

High Throughput Sequencing Technologies. UCD Genome Center Bioinformatics Core Monday 15 June 2015 High Throughput Sequencing Technologies UCD Genome Center Bioinformatics Core Monday 15 June 2015 Sequencing Explosion www.genome.gov/sequencingcosts http://t.co/ka5cvghdqo Sequencing Explosion 2011 PacBio

More information

CloG: a pipeline for closing gaps in a draft assembly using short reads

CloG: a pipeline for closing gaps in a draft assembly using short reads CloG: a pipeline for closing gaps in a draft assembly using short reads Xing Yang, Daniel Medvin, Giri Narasimhan Bioinformatics Research Group (BioRG) School of Computing and Information Sciences Miami,

More information

CBC Data Therapy. Metagenomics Discussion

CBC Data Therapy. Metagenomics Discussion CBC Data Therapy Metagenomics Discussion General Workflow Microbial sample Generate Metaomic data Process data (QC, etc.) Analysis Marker Genes Extract DNA Amplify with targeted primers Filter errors,

More information

Supplementary Data for Hybrid error correction and de novo assembly of single-molecule sequencing reads

Supplementary Data for Hybrid error correction and de novo assembly of single-molecule sequencing reads Supplementary Data for Hybrid error correction and de novo assembly of single-molecule sequencing reads Online Resources Pre&compiledsourcecodeanddatasetsusedforthispublication: http://www.cbcb.umd.edu/software/pbcr

More information

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR)

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR) tru TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR) Anton Bankevich Center for Algorithmic Biotechnology, SPbSU Sequencing costs 1. Sequencing costs do not follow Moore s law

More information

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es Sequencing technologies Jose Blanca COMAV institute bioinf.comav.upv.es Outline Sequencing technologies: Sanger 2nd generation sequencing: 3er generation sequencing: 454 Illumina SOLiD Ion Torrent PacBio

More information

Introduction to Bioinformatics. Genome sequencing & assembly

Introduction to Bioinformatics. Genome sequencing & assembly Introduction to Bioinformatics Genome sequencing & assembly Genome sequencing & assembly p DNA sequencing How do we obtain DNA sequence information from organisms? p Genome assembly What is needed to put

More information

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis -Seq Analysis Quality Control checks Reproducibility Reliability -seq vs Microarray Higher sensitivity and dynamic range Lower technical variation Available for all species Novel transcript identification

More information

Parts of a standard FastQC report

Parts of a standard FastQC report FastQC FastQC, written by Simon Andrews of Babraham Bioinformatics, is a very popular tool used to provide an overview of basic quality control metrics for raw next generation sequencing data. There are

More information

Greene 1. Finishing of DEUG The entire genome of Drosophila eugracilis has recently been sequenced using Roche

Greene 1. Finishing of DEUG The entire genome of Drosophila eugracilis has recently been sequenced using Roche Greene 1 Harley Greene Bio434W Elgin Finishing of DEUG4927002 Abstract The entire genome of Drosophila eugracilis has recently been sequenced using Roche 454 pyrosequencing and Illumina paired-end reads

More information

Introduction to Next Generation Sequencing (NGS)

Introduction to Next Generation Sequencing (NGS) Introduction to Next eneration Sequencing (NS) Simon Rasmussen Assistant Professor enter for Biological Sequence analysis Technical University of Denmark 2012 Today 9.00-9.45: Introduction to NS, How it

More information

Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz

Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz Table of Contents Supplementary Note 1: Unique Anchor Filtering Supplementary Figure

More information

De novo whole genome assembly

De novo whole genome assembly De novo whole genome assembly Qi Sun Bioinformatics Facility Cornell University Sequencing platforms Short reads: o Illumina (150 bp, up to 300 bp) Long reads (>10kb): o PacBio SMRT; o Oxford Nanopore

More information

High throughput sequencing technologies

High throughput sequencing technologies High throughput sequencing technologies and NGS applications Mei-yeh Lu 呂美曄 High Throughput Sequencing Core Manager g g p q g g Academia Sinica 6/30/2011 Outlines Evolution of sequencing technologies Sanger

More information

Next-generation sequencing technologies

Next-generation sequencing technologies Next-generation sequencing technologies NGS applications Illumina sequencing workflow Overview Sequencing by ligation Short-read NGS Sequencing by synthesis Illumina NGS Single-molecule approach Long-read

More information

Next Generation Sequencing (NGS)

Next Generation Sequencing (NGS) Next Generation Sequencing (NGS) Fernando Alvarez Sección Biomatemática, Facultad de Ciencias, UdelaR 1 Uruguay Montevide o 3 TANGO World Champ 1930 1950 (Maraca 4 Next Generation Sequencing module Next

More information

Welcome to the NGS webinar series

Welcome to the NGS webinar series Welcome to the NGS webinar series Webinar 1 NGS: Introduction to technology, and applications NGS Technology Webinar 2 Targeted NGS for Cancer Research NGS in cancer Webinar 3 NGS: Data analysis for genetic

More information

Sequencing techniques

Sequencing techniques Sequencing techniques Workshop on Whole Genome Sequencing and Analysis, 2-4 Oct. 2017 Learning objective: After this lecture, you should be able to account for different techniques for whole genome sequencing

More information

Repetitive DNA sequence assembly

Repetitive DNA sequence assembly Repetitive DNA sequence assembly by Yongqing Jiang Bachelor of IT (Honours) Submitted in fulfilment of the requirements for the degree of Doctor of Philosophy Deakin University November, 2017 Acknowledgements

More information

De novo whole genome assembly

De novo whole genome assembly De novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell University Data generation Sequencing Platforms Short reads: Illumina Long reads: PacBio; Oxford Nanopore Contiging/Scaffolding

More information

Genome Assembly With Next Generation Sequencers

Genome Assembly With Next Generation Sequencers Genome Assembly With Next Generation Sequencers Personal Genomics Institute 3 May, 2011 Jongsun Park Table of Contents 1 Central Dogma and Omics Studies 2 History of Sequencing Technologies 3 Genome Assembly

More information

Transcriptome analysis

Transcriptome analysis Statistical Bioinformatics: Transcriptome analysis Stefan Seemann seemann@rth.dk University of Copenhagen April 11th 2018 Outline: a) How to assess the quality of sequencing reads? b) How to normalize

More information

CISC 889 Bioinformatics (Spring 2004) Lecture 3

CISC 889 Bioinformatics (Spring 2004) Lecture 3 CISC 889 Bioinformatics (Spring 004) Lecture Genome Sequencing Li Liao Computer and Information Sciences University of Delaware Administrative Have you visited The NCBI website? Have you read Hunter s

More information

de novo paired-end short reads assembly

de novo paired-end short reads assembly 1/54 de novo paired-end short reads assembly Rayan Chikhi ENS Cachan Brittany Symbiose, Irisa, France 2/54 THESIS FOCUS Graph theory for assembly models Indexing large sequencing datasets Practical implementation

More information

Next-generation sequencing Technology Overview

Next-generation sequencing Technology Overview Next-generation sequencing Technology Overview UQ Winter School 2018 Christopher Noune, PhD AGRF Melbourne christopher.noune@agrf.org.au What is NGS? Ion Torrent PGM (Thermo-Fisher) MiSeq (Illumina) High-Throughput

More information

Finished (Almost) Sequence of Drosophila littoralis Chromosome 4 Fosmid Clone XAAA73. Seth Bloom Biology 4342 March 7, 2004

Finished (Almost) Sequence of Drosophila littoralis Chromosome 4 Fosmid Clone XAAA73. Seth Bloom Biology 4342 March 7, 2004 Finished (Almost) Sequence of Drosophila littoralis Chromosome 4 Fosmid Clone XAAA73 Seth Bloom Biology 4342 March 7, 2004 Summary: I successfully sequenced Drosophila littoralis fosmid clone XAAA73. The

More information

Incorporating Molecular ID Technology. Accel-NGS 2S MID Indexing Kits

Incorporating Molecular ID Technology. Accel-NGS 2S MID Indexing Kits Incorporating Molecular ID Technology Accel-NGS 2S MID Indexing Kits Molecular Identifiers (MIDs) MIDs are indices used to label unique library molecules MIDs can assess duplicate molecules in sequencing

More information

Unfortunately, plate data was not available to generate an initial low coverage

Unfortunately, plate data was not available to generate an initial low coverage Finishing Report Carolyn Cain Fosmid: XBAA-30G19 Fosmid XBAA-30G19 was challenging to assemble, but ultimately came together in a satisfactory final assembly. The initial problems in my assembly were gaps,

More information

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Tuesday December 16, 2014

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Tuesday December 16, 2014 High Throughput Sequencing Technologies J Fass UCD Genome Center Bioinformatics Core Tuesday December 16, 2014 Sequencing Explosion www.genome.gov/sequencingcosts http://t.co/ka5cvghdqo Sequencing Explosion

More information

Next- gen sequencing. STAMPS 2015 Hilary G. Morrison Joe Vineis, Nora Downey, Be>e Hecox- Lea, Kim Finnegan

Next- gen sequencing. STAMPS 2015 Hilary G. Morrison Joe Vineis, Nora Downey, Be>e Hecox- Lea, Kim Finnegan Next- gen sequencing STAMPS 2015 Hilary G. Morrison Joe Vineis, Nora Downey, Be>e Hecox- Lea, Kim Finnegan QuesIons What is the difference between standard and next- gen sequencing? How is next- gen sequencing

More information

The goal of this project was to prepare the DEUG contig which covers the

The goal of this project was to prepare the DEUG contig which covers the Prakash 1 Jaya Prakash Dr. Elgin, Dr. Shaffer Biology 434W 10 February 2017 Finishing of DEUG4927010 Abstract The goal of this project was to prepare the DEUG4927010 contig which covers the terminal 99,279

More information

NOW GENERATION SEQUENCING. Monday, December 5, 11

NOW GENERATION SEQUENCING. Monday, December 5, 11 NOW GENERATION SEQUENCING 1 SEQUENCING TIMELINE 1953: Structure of DNA 1975: Sanger method for sequencing 1985: Human Genome Sequencing Project begins 1990s: Clinical sequencing begins 1998: NHGRI $1000

More information