Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory

Size: px
Start display at page:

Download "Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory"

Transcription

1 Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Title Glomus intraradices: Initial Whole-Genome Shotgun Sequencing and Assembly Results Permalink Author Shapiro, Harris Publication Date escholarship.org Powered by the California Digital Library University of California

2 Glomus intraradices Initial Whole-Genome Shotgun Sequencing and Assembly Results Harris Shapiro DOE Joint Genome Institute This work was performed under the auspices of the US Department of Energy's Office of Science, Biological and Environmental Research Program, and by the University of California, Lawrence Livermore National Laboratory under Contract No. W Eng-48, Lawrence Berkeley National Laboratory under contract No. DE-AC02-05CH11231 and Los Alamos National Laboratory under contract No. W-7405-ENG-36.

3 Overview Review of the Whole-Genome Shotgun (WGS) assembly procedure WGS library Quality Control (QC) G.intraradices test assembly Polymorphism estimates Data set anomalies Where to go from here COST 8.38 Meeting, Dijon, 3/6/2005 2

4 Outline of the Assembly Process: JAZZ, the JGI In-House Assembler Spores Extract DNA Pair-aligned reads Convert to graph: Reads = Nodes Edges = Pair alignments Shred & insert DNA 3 kb 8 kb 40 kb Clone libraries Forward & reverse sequencing Traverse graph to generate contigs and scaffolds End sequences (reads) with approximately known separations Read pairing information helps span COST 8.38 Meeting, Dijon, 3/6/2005 sections without good alignments 3

5 So What Could Go Wrong? Repeat Elements A R B R C Shred Assemble??? Polymorphism Haplotype 1 Haplotype 2?? Shred Assemble?? Stricter assembly parameters can distinguish (some) repeats, but make it more likely that haplotypes will assembly separately. COST 8.38 Meeting, Dijon, 3/6/2005 4

6 WGS Library Quality Control Identification of insertless clones Trimming of vector and low quality sequence Examination of the GC content distribution BLAST analysis COST 8.38 Meeting, Dijon, 3/6/2005 5

7 GC Content Comparison: Initial WGS Libraries 0.25 G.intraradices, Old WGS Libraries, GC Content Comparison AHZO (3 kb, 7,398 reads) AHZP (8 kb, 13,759 reads) AHZS (40 kb, 2,907 reads) 0.2 Fraction of Reads GC Content COST 8.38 Meeting, Dijon, 3/6/2005 6

8 GC Content Comparison: Second WGS Libraries 0.12 G.intraradices, New WGS Libraries, GC Content Comparison BCTU (3 kb, 7,228 reads) BCTW (8 kb, 7,548 reads) ATSX (40 kb, 1,479 reads) Fraction of Reads GC Content COST 8.38 Meeting, Dijon, 3/6/2005 7

9 Test Assembly Specification All 6 WGS libraries were included: ~28.7 MB of trimmed sequence, of which perhaps 4 7 MB consisted of various contaminants Genome size estimate: ~15 MB (Hijri & Sanders, 2004) Estimated depth: 1 1.5x; set to 1.0 for the assembly Repeats: Maximum copy number of 5 times the estimated depth (5) before they can t be used to seed an alignment Polymorphism/repeat divergence: used the default value of ~3% different for a neutral alignment COST 8.38 Meeting, Dijon, 3/6/2005 8

10 Summary of Test Assembly Results 615 scaffolds, containing 704 KB of scaffold sequence (582 KB contig sequence = 17.3% gap). After removing short and redundant scaffolds, 165 were left, with 318 KB of scaffold sequence (197 KB contig sequence = 38.1% gap) Filtered scaffold set was easily partitioned: Prokaryotic contaminant: GC content > 0.60 Cloning artifacts: GC content between 0.50 and 0.60 (with one exception) Potential G.intraradices scaffolds: GC content < 0.50 (with one exception) EST alignments supported the identification of the low-gc scaffolds COST 8.38 Meeting, Dijon, 3/6/2005 9

11 Test Assembly Statistics Scaffold Set Potential G.intraradices Cloning Artifact Prokaryotic Contamination Number of Scaffolds Scaffold Sequence Total (KB) Contig Sequence Total (KB) COST 8.38 Meeting, Dijon, 3/6/

12 Test Assembly: BLAST Analysis G.intraradices, 4/25/2005 Assembly, Filtered Scaffolds, Revised BLAST Results Filtered Scaffolds Prokaryotic-Only, Cloning Artifacts Prokaryotic-Only, Other 0.6 GC Content Mean Scaffold Sequence Depth (Excluding Gaps) COST 8.38 Meeting, Dijon, 3/6/

13 Test Assembly: EST Analysis 0.8 G.intraradices, 4/25/2005 Filtered Scaffolds, Alignments to cdna Sequences Filtered Scaffolds cdna Hits GC Content Mean Scaffold Sequence Depth (Excluding Gaps) COST 8.38 Meeting, Dijon, 3/6/

14 Polymorphism Estimation Procedure Generation of reference sequence: draft & finished subcloned fosmids Alignment of WGS & EST reads to the reference sequence Identification of good alignments Reads with unique alignments to the reference fosmids Greater than 97% of the read covered by the alignment Polymorphism calculations % ID of the good alignments (Number of matches/alignment footprint) Mismatch % of the good alignments (Number of mismatches/read length) COST 8.38 Meeting, Dijon, 3/6/

15 Subcloned Fosmid Creation First batch: 15 randomly selected fosmids from the AHZS library. Consisted almost entirely of cloning artifacts. Second batch: 10 targeted fosmids from the AHZS library. Due to cloning artifacts and sequencing problems, this yielded 5 finished fosmids, containing about 181 KB of reference sequence. Third batch: 10 targeted fosmids from the ATSX library. All 10 yielded draft sequences. COST 8.38 Meeting, Dijon, 3/6/

16 Sample Finished Fosmid WGS Alignments COST 8.38 Meeting, Dijon, 3/6/

17 Sample Draft Fosmid WGS Alignments COST 8.38 Meeting, Dijon, 3/6/

18 Summary of the WGS Polymorphism Results Polymorphism Estimate Data Set Number of Reads % ID Method Mismatch Method AHZO % +/- 4.9% 3.5% +/- 2.9% AHZP % +/- 4.2% 2.6% +/- 2.7% BCTU % +/- 6.1% 2.3% +/- 2.5% Overall % +/- 5.1% N/A COST 8.38 Meeting, Dijon, 3/6/

19 Details of the WGS Polymorphism Results With the % ID Method: All three WGS libraries had their main polymorphism peak at 0%- 1%. All three WGS libraries had a secondary peak at 7%-8%. AHZO may have had an additional secondary peak at 5%. With the Mismatch Method: All three WGS libraries had their main polymorphism peak at 0%- 1%. AHZO and BCTU still had secondary peaks at 7% - 8%. AHZO still had a secondary peak at 5%. AHZP no longer had any secondary peaks, instead having its mismatch rate decline smoothly to a background value. COST 8.38 Meeting, Dijon, 3/6/

20 WGS Data Set Issues Only produced plausible alignments to 2 of the 10 subcloned fosmids in the latest batch Very non-uniform coverage of some of the draft and finished fosmids Potential causes: Mis-assembled or finished fosmids Chimeric fosmid clones Undetected non-glomus contamination Library bias Statistical artifact COST 8.38 Meeting, Dijon, 3/6/

21 Sample Finished Fosmid EST Alignments COST 8.38 Meeting, Dijon, 3/6/

22 Sample Draft Fosmid EST Alignments COST 8.38 Meeting, Dijon, 3/6/

23 EST Polymorphism Results Very few EST sequences aligned to the subcloned fosmid sequences. Only 1 of the 5 finished fosmids had any alignments Only 2 of the 10 draft fosmids had plausible alignments The longest alignments all seemed to be missing sections of the ESTs, on the order of hundreds of bases. Due to the small number of alignments and the anomalies associated with them, it was not possible to estimate the polymorphism rate using this data set. COST 8.38 Meeting, Dijon, 3/6/

24 EST Data Set Issues Only produced plausible alignments to 2 of the 10 subcloned fosmids in the latest batch Very small number of alignments overall Alignments that were present did not cover large portions of the ESTs Potential causes: Mis-assembled or finished fosmids Chimeric fosmid clones Undetected non-glomus contamination Library bias Spurious alignments The haplotypes actually differ greatly from each other Statistical artifact COST 8.38 Meeting, Dijon, 3/6/

25 Where Do We Go From Here? Investigation of the WGS and EST subcloned fosmid anomalies Creation of a third round of WGS libraries Further attempts at WGS assembly Requires the production of a good set of WGS libraries Analysis of the possible repeats, which is complicated by high polymorphism Initially strict parameters to force the haplotypes to assemble separately Distribution of polymorphic elements might require a relaxedparameter assembly, to try to force the haplotypes to assemble together Pre-screening of repeat elements may allow some sidestepping of the repeat/polymorphism trade-off COST 8.38 Meeting, Dijon, 3/6/

26 Acknowledgements Joint Genome Institute Library Creation: Eileen Dalin, Chris Detter, Paul Richardson Production Sequencing: Tijana Glavina, Miranda Harmon-Smith, Susan Lucas Genome Assembly: Nik Putnam, Jarrod Chapman, Isaac Ho, Dan Rokhsar ESTs: Erika Lindquist, Peter Brokstein Subcloned Fosmids: Jane Grimwood Peter Lammers & his group Ian Sanders & his group This work was performed under the auspices of the US Department of Energy's Office of Science, Biological and Environmental Research Program and by the University of California, Lawrence Livermore National Laboratory under Contract No. W-7405-Eng- 48, Lawrence Berkeley National Laboratory under contract No. DE-AC03-76SF00098 and Los Alamos National Laboratory under contract No. W-7405-ENG-36. COST 8.38 Meeting, Dijon, 3/6/

27 AHZO: ATYW (Finished Fosmid) Alignments COST 8.38 Meeting, Dijon, 3/6/

28 AHZO: ATYY (Finished Fosmid) Alignments COST 8.38 Meeting, Dijon, 3/6/

29 AHZO: ATYZ (Finished Fosmid) Alignments COST 8.38 Meeting, Dijon, 3/6/

30 AHZO: ATZC (Finished Fosmid) Alignments COST 8.38 Meeting, Dijon, 3/6/

Genome Sequencing-- Strategies

Genome Sequencing-- Strategies Genome Sequencing-- Strategies Bio 4342 Spring 04 What is a genome? A genome can be defined as the entire DNA content of each nucleated cell in an organism Each organism has one or more chromosomes that

More information

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Sequence Assembly and Alignment. Jim Noonan Department of Genetics Sequence Assembly and Alignment Jim Noonan Department of Genetics james.noonan@yale.edu www.yale.edu/noonanlab The assembly problem >>10 9 sequencing reads 36 bp - 1 kb 3 Gb Outline Basic concepts in genome

More information

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es Sequencing project Unknown sequence { experimental evidence result read 1 read 4 read 2 read 5 read 3 read 6 read 7 Computational requirements

More information

How to Characterize Rust Isolates: Genome Sequencing, Molecular Markers and Differentials

How to Characterize Rust Isolates: Genome Sequencing, Molecular Markers and Differentials How to Characterize Rust Isolates: Genome Sequencing, Molecular Markers and Differentials Dr. Reid D. Frederick USDA-Agricultural Research Service Foreign Disease-Weed Science Research Unit Fort Detrick,

More information

Genomics AGRY Michael Gribskov Hock 331

Genomics AGRY Michael Gribskov Hock 331 Genomics AGRY 60000 Michael Gribskov gribskov@purdue.edu Hock 331 Computing Essentials Resources In this course we will assemble and annotate both genomic and transcriptomic sequence assemblies We will

More information

The Diploid Genome Sequence of an Individual Human

The Diploid Genome Sequence of an Individual Human The Diploid Genome Sequence of an Individual Human Maido Remm Journal Club 12.02.2008 Outline Background (history, assembling strategies) Who was sequenced in previous projects Genome variations in J.

More information

Unfortunately, plate data was not available to generate an initial low coverage

Unfortunately, plate data was not available to generate an initial low coverage Finishing Report Carolyn Cain Fosmid: XBAA-30G19 Fosmid XBAA-30G19 was challenging to assemble, but ultimately came together in a satisfactory final assembly. The initial problems in my assembly were gaps,

More information

Biol 478/595 Intro to Bioinformatics

Biol 478/595 Intro to Bioinformatics Biol 478/595 Intro to Bioinformatics September M 1 Labor Day 4 W 3 MG Database Searching Ch. 6 5 F 5 MG Database Searching Hw1 6 M 8 MG Scoring Matrices Ch 3 and Ch 4 7 W 10 MG Pairwise Alignment 8 F 12

More information

CSE182-L16. LW statistics/assembly

CSE182-L16. LW statistics/assembly CSE182-L16 LW statistics/assembly Silly Quiz Who are these people, and what is the occasion? Genome Sequencing and Assembly Sequencing A break at T is shown here. Measuring the lengths using electrophoresis

More information

Genome Assembly of the Obligate Crassulacean Acid Metabolism (CAM) Species Kalanchoë laxiflora

Genome Assembly of the Obligate Crassulacean Acid Metabolism (CAM) Species Kalanchoë laxiflora Genome Assembly of the Obligate Crassulacean Acid Metabolism (CAM) Species Kalanchoë laxiflora Jerry Jenkins 1, Xiaohan Yang 2, Hengfu Yin 2, Gerald A. Tuskan 2, John C. Cushman 3, Won Cheol Yim 3, Jason

More information

Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome

Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome Ruth Howe Bio 434W 27 February 2010 Abstract The fourth or dot chromosome of Drosophila species is composed primarily of highly condensed,

More information

Genome Projects. Part III. Assembly and sequencing of human genomes

Genome Projects. Part III. Assembly and sequencing of human genomes Genome Projects Part III Assembly and sequencing of human genomes All current genome sequencing strategies are clone-based. 1. ordered clone sequencing e.g., C. elegans well suited for repetitive sequences

More information

Course summary. Today. PCR Polymerase chain reaction. Obtaining molecular data. Sequencing. DNA sequencing. Genome Projects.

Course summary. Today. PCR Polymerase chain reaction. Obtaining molecular data. Sequencing. DNA sequencing. Genome Projects. Goals Organization Labs Project Reading Course summary DNA sequencing. Genome Projects. Today New DNA sequencing technologies. Obtaining molecular data PCR Typically used in empirical molecular evolution

More information

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR)

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR) tru TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR) Anton Bankevich Center for Algorithmic Biotechnology, SPbSU Sequencing costs 1. Sequencing costs do not follow Moore s law

More information

Alignment and Assembly

Alignment and Assembly Alignment and Assembly Genome assembly refers to the process of taking a large number of short DNA sequences and putting them back together to create a representation of the original chromosomes from which

More information

Finishing Drosophila grimshawi Fosmid Clone DGA23F17. Kenneth Smith Biology 434W Professor Elgin February 20, 2009

Finishing Drosophila grimshawi Fosmid Clone DGA23F17. Kenneth Smith Biology 434W Professor Elgin February 20, 2009 Finishing Drosophila grimshawi Fosmid Clone DGA23F17 Kenneth Smith Biology 434W Professor Elgin February 20, 2009 Smith 2 Abstract: My fosmid, DGA23F17, did not present too many challenges when I finished

More information

Direct determination of diploid genome sequences. Supplemental material: contents

Direct determination of diploid genome sequences. Supplemental material: contents Direct determination of diploid genome sequences Neil I. Weisenfeld, Vijay Kumar, Preyas Shah, Deanna M. Church, David B. Jaffe Supplemental material: contents Supplemental Note 1. Comparison of performance

More information

Contact us for more information and a quotation

Contact us for more information and a quotation GenePool Information Sheet #1 Installed Sequencing Technologies in the GenePool The GenePool offers sequencing service on three platforms: Sanger (dideoxy) sequencing on ABI 3730 instruments Illumina SOLEXA

More information

SUPPLEMENTARY MATERIAL FOR THE PAPER: RASCAF: IMPROVING GENOME ASSEMBLY WITH RNA-SEQ DATA

SUPPLEMENTARY MATERIAL FOR THE PAPER: RASCAF: IMPROVING GENOME ASSEMBLY WITH RNA-SEQ DATA SUPPLEMENTARY MATERIAL FOR THE PAPER: RASCAF: IMPROVING GENOME ASSEMBLY WITH RNA-SEQ DATA Authors: Li Song, Dhruv S. Shankar, Liliana Florea Table of contents: Figure S1. Methods finding contig connections

More information

Finishing of Fosmid 1042D14. Project 1042D14 is a roughly 40 kb segment of Drosophila ananassae

Finishing of Fosmid 1042D14. Project 1042D14 is a roughly 40 kb segment of Drosophila ananassae Schefkind 1 Adam Schefkind Bio 434W 03/08/2014 Finishing of Fosmid 1042D14 Abstract Project 1042D14 is a roughly 40 kb segment of Drosophila ananassae genomic DNA. Through a comprehensive analysis of forward-

More information

Mapping strategies for sequence reads

Mapping strategies for sequence reads Mapping strategies for sequence reads Ernest Turro University of Cambridge 21 Oct 2013 Quantification A basic aim in genomics is working out the contents of a biological sample. 1. What distinct elements

More information

Improving Genome Assemblies without Sequencing

Improving Genome Assemblies without Sequencing Improving Genome Assemblies without Sequencing Michael Schatz April 25, 2005 TIGR Bioinformatics Seminar Assembly Pipeline Overview 1. Sequence shotgun reads 2. Call Bases 3. Trim Reads 4. Assemble phred/tracetuner/kb

More information

Finishing Drosophila Ananassae Fosmid 2728G16

Finishing Drosophila Ananassae Fosmid 2728G16 Finishing Drosophila Ananassae Fosmid 2728G16 Kyle Jung March 8, 2013 Bio434W Professor Elgin Page 1 Abstract For my finishing project, I chose to finish fosmid 2728G16. This fosmid carries a segment of

More information

Molecular Biology: DNA sequencing

Molecular Biology: DNA sequencing Molecular Biology: DNA sequencing Author: Prof Marinda Oosthuizen Licensed under a Creative Commons Attribution license. SEQUENCING OF LARGE TEMPLATES As we have seen, we can obtain up to 800 nucleotides

More information

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014 Introduction to metagenome assembly Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014 Sequencing specs* Method Read length Accuracy Million reads Time Cost per M 454

More information

Sundaram DGA43A19 Page 1. Finishing Drosophila grimshawi Fosmid: DGA43A19 Varun Sundaram 2/16/09

Sundaram DGA43A19 Page 1. Finishing Drosophila grimshawi Fosmid: DGA43A19 Varun Sundaram 2/16/09 Sundaram DGA43A19 Page 1 Finishing Drosophila grimshawi Fosmid: DGA43A19 Varun Sundaram 2/16/09 Sundaram DGA43A19 Page 2 Abstract My project focused on the fosmid clone DGA43A19. The main problems with

More information

De Novo and Hybrid Assembly

De Novo and Hybrid Assembly On the PacBio RS Introduction The PacBio RS utilizes SMRT technology to generate both Continuous Long Read ( CLR ) and Circular Consensus Read ( CCS ) data. In this document, we describe sequencing the

More information

De novo meta-assembly of ultra-deep sequencing data

De novo meta-assembly of ultra-deep sequencing data De novo meta-assembly of ultra-deep sequencing data Hamid Mirebrahim 1, Timothy J. Close 2 and Stefano Lonardi 1 1 Department of Computer Science and Engineering 2 Department of Botany and Plant Sciences

More information

Mate-pair library data improves genome assembly

Mate-pair library data improves genome assembly De Novo Sequencing on the Ion Torrent PGM APPLICATION NOTE Mate-pair library data improves genome assembly Highly accurate PGM data allows for de Novo Sequencing and Assembly For a draft assembly, generate

More information

Finished (Almost) Sequence of Drosophila littoralis Chromosome 4 Fosmid Clone XAAA73. Seth Bloom Biology 4342 March 7, 2004

Finished (Almost) Sequence of Drosophila littoralis Chromosome 4 Fosmid Clone XAAA73. Seth Bloom Biology 4342 March 7, 2004 Finished (Almost) Sequence of Drosophila littoralis Chromosome 4 Fosmid Clone XAAA73 Seth Bloom Biology 4342 March 7, 2004 Summary: I successfully sequenced Drosophila littoralis fosmid clone XAAA73. The

More information

3) This diagram represents: (Indicate all correct answers)

3) This diagram represents: (Indicate all correct answers) Functional Genomics Midterm II (self-questions) 2/4/05 1) One of the obstacles in whole genome assembly is dealing with the repeated portions of DNA within the genome. How do repeats cause complications

More information

Illumina Read QC. UCD Genome Center Bioinformatics Core Monday 29 August 2016

Illumina Read QC. UCD Genome Center Bioinformatics Core Monday 29 August 2016 Illumina Read QC UCD Genome Center Bioinformatics Core Monday 29 August 2016 QC should be interactive Error modes Each technology has unique error modes, depending on the physico-chemical processes involved

More information

NGS developments in tomato genome sequencing

NGS developments in tomato genome sequencing NGS developments in tomato genome sequencing 16-02-2012, Sandra Smit TATGTTTTGGAAAACATTGCATGCGGAATTGGGTACTAGGTTGGACCTTAGTACC GCGTTCCATCCTCAGACCGATGGTCAGTCTGAGAGAACGATTCAAGTGTTGGAAG ATATGCTTCGTGCATGTGTGATAGAGTTTGGTGGCCATTGGGATAGCTTCTTACC

More information

10/20/2009 Comp 590/Comp Fall

10/20/2009 Comp 590/Comp Fall Lecture 14: DNA Sequencing Study Chapter 8.9 10/20/2009 Comp 590/Comp 790-90 Fall 2009 1 DNA Sequencing Shear DNA into millions of small fragments Read 500 700 nucleotides at a time from the small fragments

More information

Finishing Drosophila ananassae Fosmid 2410F24

Finishing Drosophila ananassae Fosmid 2410F24 Nick Spies Research Explorations in Genomics Finishing Report Elgin, Shaffer and Leung 23 February 2013 Abstract: Finishing Drosophila ananassae Fosmid 2410F24 Finishing Drosophila ananassae fosmid clone

More information

Transcriptome Assembly, Functional Annotation (and a few other related thoughts)

Transcriptome Assembly, Functional Annotation (and a few other related thoughts) Transcriptome Assembly, Functional Annotation (and a few other related thoughts) Monica Britton, Ph.D. Sr. Bioinformatics Analyst June 23, 2017 Differential Gene Expression Generalized Workflow File Types

More information

NEXT GENERATION SEQUENCING. Farhat Habib

NEXT GENERATION SEQUENCING. Farhat Habib NEXT GENERATION SEQUENCING HISTORY HISTORY Sanger Dominant for last ~30 years 1000bp longest read Based on primers so not good for repetitive or SNPs sites HISTORY Sanger Dominant for last ~30 years 1000bp

More information

The Genome Analysis Centre. Building Excellence in Genomics and Computational Bioscience

The Genome Analysis Centre. Building Excellence in Genomics and Computational Bioscience Building Excellence in Genomics and Computational Bioscience Wheat genome sequencing: an update from TGAC Sequencing Technology Development now Plant & Microbial Genomics Group Leader Matthew Clark matt.clark@tgac.ac.uk

More information

AMOS Assembly Validation and Visualization

AMOS Assembly Validation and Visualization AMOS Assembly Validation and Visualization Michael Schatz Center for Bioinformatics and Computational Biology University of Maryland August 13, 2006 University of Hawaii Outline AMOS Validation Pipeline

More information

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter A shotgun introduction to sequence assembly (with Velvet) MCB 247 - Brem, Eisen and Pachter Hot off the press January 27, 2009 06:00 AM Eastern Time llumina Launches Suite of Next-Generation Sequencing

More information

De Novo Assembly of High-throughput Short Read Sequences

De Novo Assembly of High-throughput Short Read Sequences De Novo Assembly of High-throughput Short Read Sequences Chuming Chen Center for Bioinformatics and Computational Biology (CBCB) University of Delaware NECC Third Skate Genome Annotation Workshop May 23,

More information

Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation

Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation Assembly of the Human Genome Daniel Huson Informatics Research Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation

More information

BENG 183 Trey Ideker. Genome Assembly and Physical Mapping

BENG 183 Trey Ideker. Genome Assembly and Physical Mapping BENG 183 Trey Ideker Genome Assembly and Physical Mapping Reasons for sequencing Complete genome sequencing!!! Resequencing (Confirmatory) E.g., short regions containing single nucleotide polymorphisms

More information

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Number and length distributions of the inferred fosmids.

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Number and length distributions of the inferred fosmids. Supplementary Figure 1 Number and length distributions of the inferred fosmids. Fosmid were inferred by mapping each pool s sequence reads to hg19. We retained only those reads that mapped to within a

More information

Read Quality Assessment & Improvement. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

Read Quality Assessment & Improvement. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016 Read Quality Assessment & Improvement UCD Genome Center Bioinformatics Core Tuesday 14 June 2016 QA&I should be interactive Error modes Each technology has unique error modes, depending on the physico-chemical

More information

N50 must die!? Genome assembly workshop, Santa Cruz, 3/15/11

N50 must die!? Genome assembly workshop, Santa Cruz, 3/15/11 N50 must die!? Genome assembly workshop, Santa Cruz, 3/15/11 twitter: @assemblathon web: assemblathon.org Should N50 die in its role as a frequently used measure of genome assembly quality? Are there other

More information

Metagenomic 3C, full length 16S amplicon sequencing on Illumina, and the diabetic skin microbiome

Metagenomic 3C, full length 16S amplicon sequencing on Illumina, and the diabetic skin microbiome Also: Sunaina Melissa Gardiner UTS Catherine Burke UTS Michael Liu UTS Chris Beitel UTS, UC Davis Matt DeMaere UTS Metagenomic 3C, full length 16S amplicon sequencing on Illumina, and the diabetic skin

More information

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G Introduction: A genome is the total genetic content of

More information

Targeted Sequencing Using Droplet-Based Microfluidics. Keith Brown Director, Sales

Targeted Sequencing Using Droplet-Based Microfluidics. Keith Brown Director, Sales Targeted Sequencing Using Droplet-Based Microfluidics Keith Brown Director, Sales brownk@raindancetech.com Who we are: is a Provider of Microdroplet-based Solutions The Company s RainStorm TM Technology

More information

A Guide to Consed Michelle Itano, Carolyn Cain, Tien Chusak, Justin Richner, and SCR Elgin.

A Guide to Consed Michelle Itano, Carolyn Cain, Tien Chusak, Justin Richner, and SCR Elgin. 1 A Guide to Consed Michelle Itano, Carolyn Cain, Tien Chusak, Justin Richner, and SCR Elgin. Main Window Figure 1. The Main Window is the starting point when Consed is opened. From here, you can access

More information

ChIP-seq and RNA-seq

ChIP-seq and RNA-seq ChIP-seq and RNA-seq Biological Goals Learn how genomes encode the diverse patterns of gene expression that define each cell type and state. Protein-DNA interactions (ChIPchromatin immunoprecipitation)

More information

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014 1 short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014 2 Genomathica Assembler Mathematica notebook for genome assembly simulation Assembler can be found at:

More information

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

Genome Assembly Using de Bruijn Graphs. Biostatistics 666 Genome Assembly Using de Bruijn Graphs Biostatistics 666 Previously: Reference Based Analyses Individual short reads are aligned to reference Genotypes generated by examining reads overlapping each position

More information

Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing (HaploSeq)

Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing (HaploSeq) Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing (HaploSeq) Lyon Lab Journal Clubs Han Fang 01/28/2014 Lyon Lab Journal Clubs 1 Lyon Lab Journal Clubs 2 Existing genome

More information

Quality Control of High Throughput EST & SAGE Sequence Production

Quality Control of High Throughput EST & SAGE Sequence Production European Spotfire Users Conference - Versailles, April 25-26, 2002 Quality Control of High Throughput EST & SAGE Sequence Production Marcel de Leeuw IT director Summary I. Application field II. Why quality

More information

1 Abstract. 2 Introduction. 3 Requirements. 4 Procedure

1 Abstract. 2 Introduction. 3 Requirements. 4 Procedure 1 Abstract 2 Introduction This SOP describes the procedure for creating the reference genomes database used for HMP WGS read mapping. The database is comprised of all archaeal, bacterial, lower eukaryote

More information

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments BLAST 100 times faster than dynamic programming. Good for database searches. Derive a list of words of length w from query (e.g., 3 for protein, 11 for DNA) High-scoring words are compared with database

More information

RNA-Sequencing analysis

RNA-Sequencing analysis RNA-Sequencing analysis Markus Kreuz 25. 04. 2012 Institut für Medizinische Informatik, Statistik und Epidemiologie Content: Biological background Overview transcriptomics RNA-Seq RNA-Seq technology Challenges

More information

Genome evolution on the allotetraploid Xenopus laevis

Genome evolution on the allotetraploid Xenopus laevis Genome evolution on the allotetraploid Xenopus laevis Taejoon Kwon Department of Biomedical Engineering, School of Life Sciences Ulsan National Institute of Science & Technology (UNIST) Xenopus Bioinformatics

More information

Parts of a standard FastQC report

Parts of a standard FastQC report FastQC FastQC, written by Simon Andrews of Babraham Bioinformatics, is a very popular tool used to provide an overview of basic quality control metrics for raw next generation sequencing data. There are

More information

Annotating Fosmid 14p24 of D. Virilis chromosome 4

Annotating Fosmid 14p24 of D. Virilis chromosome 4 Lo 1 Annotating Fosmid 14p24 of D. Virilis chromosome 4 Lo, Louis April 20, 2006 Annotation Report Introduction In the first half of Research Explorations in Genomics I finished a 38kb fragment of chromosome

More information

Bioinformatics for Genomics

Bioinformatics for Genomics Bioinformatics for Genomics It has not escaped our notice that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material. When I was young my Father

More information

Introduction to Plant Genomics and Online Resources. Manish Raizada University of Guelph

Introduction to Plant Genomics and Online Resources. Manish Raizada University of Guelph Introduction to Plant Genomics and Online Resources Manish Raizada University of Guelph Genomics Glossary http://www.genomenewsnetwork.org/articles/06_00/sequence_primer.shtml Annotation Adding pertinent

More information

UHT Sequencing Course Large-scale genotyping. Christian Iseli January 2009

UHT Sequencing Course Large-scale genotyping. Christian Iseli January 2009 UHT Sequencing Course Large-scale genotyping Christian Iseli January 2009 Overview Introduction Examples Base calling method and parameters Reads filtering Reads classification Detailed alignment Alignments

More information

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica The Ensembl Database Dott.ssa Inga Prokopenko Corso di Genomica 1 www.ensembl.org Lecture 7.1 2 What is Ensembl? Public annotation of mammalian and other genomes Open source software Relational database

More information

Sequencing the genomes of Nicotiana sylvestris and Nicotiana tomentosiformis Nicolas Sierro

Sequencing the genomes of Nicotiana sylvestris and Nicotiana tomentosiformis Nicolas Sierro Sequencing the genomes of Nicotiana sylvestris and Nicotiana tomentosiformis Nicolas Sierro Philip Morris International R&D, Philip Morris Products S.A., Neuchatel, Switzerland Introduction Nicotiana sylvestris

More information

ChIP-seq and RNA-seq. Farhat Habib

ChIP-seq and RNA-seq. Farhat Habib ChIP-seq and RNA-seq Farhat Habib fhabib@iiserpune.ac.in Biological Goals Learn how genomes encode the diverse patterns of gene expression that define each cell type and state. Protein-DNA interactions

More information

Supplementary Figure 1. Design of the control microarray. a, Genomic DNA from the

Supplementary Figure 1. Design of the control microarray. a, Genomic DNA from the Supplementary Information Supplementary Figures Supplementary Figure 1. Design of the control microarray. a, Genomic DNA from the strain M8 of S. ruber and a fosmid containing the S. ruber M8 virus M8CR4

More information

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions Joshua N. Burton 1, Andrew Adey 1, Rupali P. Patwardhan 1, Ruolan Qiu 1, Jacob O. Kitzman 1, Jay Shendure 1 1 Department

More information

ABSTRACT. We present a reliable, easy to implement algorithm to generate a set of highly

ABSTRACT. We present a reliable, easy to implement algorithm to generate a set of highly ABSTRACT Title of dissertation: IMPROVING GENOME ASSEMBLY Cevat Ustun, Doctor of Philosophy, 2005 Dissertation directed by: Professor Jim Yorke and Brian Hunt Department of Physics and Math We present

More information

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler High-Throughput Bioinformatics: Re-sequencing and de novo assembly Elena Czeizler 13.11.2015 Sequencing data Current sequencing technologies produce large amounts of data: short reads The outputted sequences

More information

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014 Single Nucleotide Variant Analysis H3ABioNet May 14, 2014 Outline What are SNPs and SNVs? How do we identify them? How do we call them? SAMTools GATK VCF File Format Let s call variants! Single Nucleotide

More information

Lecture 14: DNA Sequencing

Lecture 14: DNA Sequencing Lecture 14: DNA Sequencing Study Chapter 8.9 10/17/2013 COMP 465 Fall 2013 1 Shear DNA into millions of small fragments Read 500 700 nucleotides at a time from the small fragments (Sanger method) DNA Sequencing

More information

Introduction to Sequencher. Tom Randall Center for Bioinformatics

Introduction to Sequencher. Tom Randall Center for Bioinformatics Introduction to Sequencher Tom Randall Center for Bioinformatics tarandal@email.unc.edu Introduction Importing, viewing and manipulating chromatographs Trimming chromatographs Assembly into contigs Editing

More information

DNA Sequencing and Assembly

DNA Sequencing and Assembly DNA Sequencing and Assembly CS 262 Lecture Notes, Winter 2016 February 2nd, 2016 Scribe: Mark Berger Abstract In this lecture, we survey a variety of different sequencing technologies, including their

More information

Whole Genome Sequence Data Quality Control and Validation

Whole Genome Sequence Data Quality Control and Validation Whole Genome Sequence Data Quality Control and Validation GoSeqIt ApS / Ved Klædebo 9 / 2970 Hørsholm VAT No. DK37842524 / Phone +45 26 97 90 82 / Web: www.goseqit.com / mail: mail@goseqit.com Table of

More information

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Read Complexity

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Read Complexity Supplementary Figure 1 Read Complexity A) Density plot showing the percentage of read length masked by the dust program, which identifies low-complexity sequence (simple repeats). Scrappie outputs a significantly

More information

Supplementary Figure 1 Genotyping by Sequencing (GBS) pipeline used in this study to genotype maize inbred lines. The 14,129 maize inbred lines were

Supplementary Figure 1 Genotyping by Sequencing (GBS) pipeline used in this study to genotype maize inbred lines. The 14,129 maize inbred lines were Supplementary Figure 1 Genotyping by Sequencing (GBS) pipeline used in this study to genotype maize inbred lines. The 14,129 maize inbred lines were processed following GBS experimental design 1 and bioinformatics

More information

Introduction to DNA-Sequencing

Introduction to DNA-Sequencing informatics.sydney.edu.au sih.info@sydney.edu.au The Sydney Informatics Hub provides support, training, and advice on research data, analyses and computing. Talk to us about your computing infrastructure,

More information

CBC Data Therapy. Metagenomics Discussion

CBC Data Therapy. Metagenomics Discussion CBC Data Therapy Metagenomics Discussion General Workflow Microbial sample Generate Metaomic data Process data (QC, etc.) Analysis Marker Genes Extract DNA Amplify with targeted primers Filter errors,

More information

Exploiting novel rice baseline datasets: WGS, BAC-based platinum genome sequencing and full-length transcriptomics

Exploiting novel rice baseline datasets: WGS, BAC-based platinum genome sequencing and full-length transcriptomics Exploiting novel rice baseline datasets: WGS, BAC-based platinum genome sequencing and full-length transcriptomics Dario Copetti, PhD Arizona Genomics Institute The University of Arizona International

More information

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis -Seq Analysis Quality Control checks Reproducibility Reliability -seq vs Microarray Higher sensitivity and dynamic range Lower technical variation Available for all species Novel transcript identification

More information

Supplementary Data for Hybrid error correction and de novo assembly of single-molecule sequencing reads

Supplementary Data for Hybrid error correction and de novo assembly of single-molecule sequencing reads Supplementary Data for Hybrid error correction and de novo assembly of single-molecule sequencing reads Online Resources Pre&compiledsourcecodeanddatasetsusedforthispublication: http://www.cbcb.umd.edu/software/pbcr

More information

Next Generation Sequences & Chloroplast Assembly. 8 June, 2012 Jongsun Park

Next Generation Sequences & Chloroplast Assembly. 8 June, 2012 Jongsun Park Next Generation Sequences & Chloroplast Assembly 8 June, 2012 Jongsun Park Table of Contents 1 History of Sequencing Technologies 2 Genome Assembly Processes With NGS Sequences 3 How to Assembly Chloroplast

More information

Genome sequencing in Senecio squalidus

Genome sequencing in Senecio squalidus Genome sequencing in Senecio squalidus Outline of project A new NERC funded grant, the genomic basis of adaptation and species divergence in Senecio in collaboration with Richard Abbott and Dmitry Filatov

More information

Consensus Ensemble Approaches Improve De Novo Transcriptome Assemblies

Consensus Ensemble Approaches Improve De Novo Transcriptome Assemblies University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Computer Science and Engineering: Theses, Dissertations, and Student Research Computer Science and Engineering, Department

More information

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS.

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS. !! www.clutchprep.com CONCEPT: OVERVIEW OF GENOMICS Genomics is the study of genomes in their entirety Bioinformatics is the analysis of the information content of genomes - Genes, regulatory sequences,

More information

De Novo Repeats Construction: Methods and Applications

De Novo Repeats Construction: Methods and Applications University of Connecticut DigitalCommons@UConn Doctoral Dissertations University of Connecticut Graduate School 5-5-2017 De Novo Repeats Construction: Methods and Applications Chong Chu chong.chu@engr.uconn.edu

More information

DNA sequencing. Course Info

DNA sequencing. Course Info DNA sequencing EECS 458 CWRU Fall 2004 Readings: Pevzner Ch1-4 Adams, Fields & Venter (ISBN:0127170103) Serafim Batzoglou s slides Course Info Instructor: Jing Li 509 Olin Bldg Phone: X0356 Email: jingli@eecs.cwru.edu

More information

Chapter 2: Access to Information

Chapter 2: Access to Information Chapter 2: Access to Information Outline Introduction to biological databases Centralized databases store DNA sequences Contents of DNA, RNA, and protein databases Central bioinformatics resources: NCBI

More information

Finishing Drosophila Grimshawi Fosmid Clone DGA19A15. Matthew Kwong Bio4342 Professor Elgin February 23, 2010

Finishing Drosophila Grimshawi Fosmid Clone DGA19A15. Matthew Kwong Bio4342 Professor Elgin February 23, 2010 Finishing Drosophila Grimshawi Fosmid Clone DGA19A15 Matthew Kwong Bio4342 Professor Elgin February 23, 2010 1 Abstract The primary aim of the Bio4342 course, Research Explorations and Genomics, is to

More information

Next Genera*on Sequencing II: Personal Genomics. Jim Noonan Department of Gene*cs

Next Genera*on Sequencing II: Personal Genomics. Jim Noonan Department of Gene*cs Next Genera*on Sequencing II: Personal Genomics Jim Noonan Department of Gene*cs Personal genome sequencing Iden*fying the gene*c basis of phenotypic diversity among humans Gene*c risk factors for disease

More information

Next-generation sequencing and quality control: An introduction 2016

Next-generation sequencing and quality control: An introduction 2016 Next-generation sequencing and quality control: An introduction 2016 s.schmeier@massey.ac.nz http://sschmeier.com/bioinf-workshop/ Overview Typical workflow of a genomics experiment Genome versus transcriptome

More information

Draft 3 Annotation of DGA06H06, Contig 1 Jeannette Wong Bio4342W 27 April 2009

Draft 3 Annotation of DGA06H06, Contig 1 Jeannette Wong Bio4342W 27 April 2009 Page 1 Draft 3 Annotation of DGA06H06, Contig 1 Jeannette Wong Bio4342W 27 April 2009 Page 2 Introduction: Annotation is the process of analyzing the genomic sequence of an organism. Besides identifying

More information

Theory and Application of Multiple Sequence Alignments

Theory and Application of Multiple Sequence Alignments Theory and Application of Multiple Sequence Alignments a.k.a What is a Multiple Sequence Alignment, How to Make One, and What to Do With It Brett Pickett, PhD History Structure of DNA discovered (1953)

More information

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club De novo assembly of human genomes with massively parallel short read sequencing Mikk Eelmets Journal Club 06.04.2010 Problem DNA sequencing technologies: Sanger sequencing (500-1000 bp) Next-generation

More information

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Transcriptomics analysis with RNA seq: an overview Frederik Coppens Transcriptomics analysis with RNA seq: an overview Frederik Coppens Platforms Applications Analysis Quantification RNA content Platforms Platforms Short (few hundred bases) Long reads (multiple kilobases)

More information

The Basics of Understanding Whole Genome Next Generation Sequence Data

The Basics of Understanding Whole Genome Next Generation Sequence Data The Basics of Understanding Whole Genome Next Generation Sequence Data Heather Carleton-Romer, MPH, Ph.D. ASM-CDC Infectious Disease and Public Health Microbiology Postdoctoral Fellow PulseNet USA Next

More information

Using the Potato Genome Sequence! Robin Buell! Michigan State University! Department of Plant Biology! August 15, 2010!

Using the Potato Genome Sequence! Robin Buell! Michigan State University! Department of Plant Biology! August 15, 2010! Using the Potato Genome Sequence! Robin Buell! Michigan State University! Department of Plant Biology! August 15, 2010! buell@msu.edu! 1 Whole Genome Shotgun Sequencing 2 New Technologies Revolutionize

More information

Supplementary Table 1. Summary of whole genome shotgun sequence used for genome assembly

Supplementary Table 1. Summary of whole genome shotgun sequence used for genome assembly Supplementary Tables Supplementary Table 1. Summary of whole genome shotgun sequence used for genome assembly Library Read length Raw data Filtered data insert size (bp) * Total Sequence depth Total Sequence

More information