Genome Sequencing-- Strategies

Similar documents
Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Genome Projects. Part III. Assembly and sequencing of human genomes

BENG 183 Trey Ideker. Genome Assembly and Physical Mapping

Alignment and Assembly

Course summary. Today. PCR Polymerase chain reaction. Obtaining molecular data. Sequencing. DNA sequencing. Genome Projects.

Mate-pair library data improves genome assembly

Molecular Biology: DNA sequencing

DNA Sequencing and Assembly

BIO 202 Midterm Exam Winter 2007

10/20/2009 Comp 590/Comp Fall

The Diploid Genome Sequence of an Individual Human

A Guide to Consed Michelle Itano, Carolyn Cain, Tien Chusak, Justin Richner, and SCR Elgin.

GENETICS EXAM 3 FALL a) is a technique that allows you to separate nucleic acids (DNA or RNA) by size.

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club

Introduction to Plant Genomics and Online Resources. Manish Raizada University of Guelph

3) This diagram represents: (Indicate all correct answers)

Lecture 14: DNA Sequencing

Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome

Restriction Site Mapping:

NGS developments in tomato genome sequencing

Finishing Drosophila Ananassae Fosmid 2728G16

BIOLOGY - CLUTCH CH.20 - BIOTECHNOLOGY.

Sept 2. Structure and Organization of Genomes. Today: Genetic and Physical Mapping. Sept 9. Forward and Reverse Genetics. Genetic and Physical Mapping

CSCI2950-C DNA Sequencing and Fragment Assembly

Genomic resources. for non-model systems

Contact us for more information and a quotation

Chapter 5. Structural Genomics

1. A brief overview of sequencing biochemistry

I. Gene Cloning & Recombinant DNA. Biotechnology: Figure 1: Restriction Enzyme Activity. Restriction Enzyme:

Genome Sequence Assembly

Web-based tools for Bioinformatics; A (free) introduction to (freely available) NCBI, MUSC and World-wide.

Introduction to Molecular Biology

Biol 478/595 Intro to Bioinformatics

Multiple choice questions (numbers in brackets indicate the number of correct answers)

2 Gene Technologies in Our Lives

We begin with a high-level overview of sequencing. There are three stages in this process.

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS.

CSE182-L16. LW statistics/assembly

Reading Lecture 8: Lecture 9: Lecture 8. DNA Libraries. Definition Types Construction

DNA sequencing. Course Info

Genomics AGRY Michael Gribskov Hock 331

Concepts: What are RFLPs and how do they act like genetic marker loci?

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica

Genome Analysis. Bacterial genome projects

Finishing Drosophila grimshawi Fosmid Clone DGA23F17. Kenneth Smith Biology 434W Professor Elgin February 20, 2009

B) You can conclude that A 1 is identical by descent. Notice that A2 had to come from the father (and therefore, A1 is maternal in both cases).

REVIEWS STRATEGIES FOR THE SYSTEMATIC SEQUENCING OF COMPLEX GENOMES. Eric D. Green

7 Gene Isolation and Analysis of Multiple

BS 50 Genetics and Genomics Week of Nov 29

user s guide Question 3

3I03 - Eukaryotic Genetics Repetitive DNA

Bio nformatics. Lecture 2. Saad Mneimneh

Bioinformatics for Genomics

De Novo Assembly of High-throughput Short Read Sequences

CONSTRUCTION OF GENOMIC LIBRARY

Molecular Genetics Techniques. BIT 220 Chapter 20

Exam 3 4/25/07. Total of 7 questions, 100 points.

Problem Set 4

Recombinant DNA Libraries and Forensics

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory

DESIGNER GENES SAMPLE TOURNAMENT

Connect-A-Contig Paper version

Restriction Enzymes (endonucleases)

Introduction to 'Omics and Bioinformatics

Finishing Drosophila ananassae Fosmid 2410F24

3. human genomics clone genes associated with genetic disorders. 4. many projects generate ordered clones that cover genome

Sundaram DGA43A19 Page 1. Finishing Drosophila grimshawi Fosmid: DGA43A19 Varun Sundaram 2/16/09

Chapter 20 DNA Technology & Genomics. If we can, should we?

Finished (Almost) Sequence of Drosophila littoralis Chromosome 4 Fosmid Clone XAAA73. Seth Bloom Biology 4342 March 7, 2004

Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation

Initial sequencing and comparative analysis of the mouse genome

Lecture 12. Genomics. Mapping. Definition Species sequencing ESTs. Why? Types of mapping Markers p & Types

BIOTECHNOLOGY. Sticky & blunt ends. Restriction endonucleases. Gene cloning an overview. DNA isolation & restriction

The Basics of Understanding Whole Genome Next Generation Sequence Data

user s guide Question 3

The Structure of Proteins and DNA

Chapter 20 Recombinant DNA Technology. Copyright 2009 Pearson Education, Inc.

Molecular Cell Biology - Problem Drill 11: Recombinant DNA

Chapter 15 The Human Genome Project and Genomics. Chapter 15 Human Heredity by Michael Cummings 2006 Brooks/Cole-Thomson Learning

Lecture Four. Molecular Approaches I: Nucleic Acids

BSCI410-Liu/Spring 06 Exam #1 Feb. 23, 06

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

Conditional Random Fields, DNA Sequencing. Armin Pourshafeie. February 10, 2015

35 Cancer- part 2. Lecture Outline, 11/30/05. BRCA1: DNA Repair. Case Study: BRCA1

Chapter 15 Gene Technologies and Human Applications

Research techniques in genetics. Medical genetics, 2017.

4/26/2015. Cut DNA either: Cut DNA either:

De novo genome assembly with next generation sequencing data!! "

Unfortunately, plate data was not available to generate an initial low coverage

Genome annotation & EST

Chapter 8: Recombinant DNA. Ways this technology touches us. Overview. Genetic Engineering

CHAPTERS 16 & 17: DNA Technology

Next-generation sequencing technologies

Corynebacterium pseudotuberculosis genome sequencing: Final Report

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Human genome sequence February, 2001

Introduction to some aspects of molecular genetics

Learning Objectives :

PLNT2530 (2018) Unit 6b Sequence Libraries

Transcription:

Genome Sequencing-- Strategies Bio 4342 Spring 04 What is a genome? A genome can be defined as the entire DNA content of each nucleated cell in an organism Each organism has one or more chromosomes that contain all of its genetic information Humans, for example, have a genome that is encoded on 46 chromosomes, organized into 23 pairs. One chromosome of each pair is inherited from the mother and one from the father. One pair of chromosomes determines gender ( sex chromosomes ) and the other 22 pairs are called autosomes The object of the Human Genome Project was to determine the entire DNA sequence of each of these long DNA molecules (chromosomes), and to locate and identify each of the human genes Genome sequencing process Factors that determine sequencing strategy Library creation Production sequencing Assembly X 2 Prefinishing Genome size Chromosomal structure Repeat content and character Polymorphism/inbreeding Desired product (draft, finished) Supporting information (physical or genetic maps, closely related sequenced genomes, EST sequences) Finishing 1

Genome size determines Number and types of subclones Number of sequence reads Compute power/algorithm for sequence read assembly In general, genome size and complexity scales with the size of the organism Chromosomal structure Centromeres and telomeres Pairing at meiosis even numbers?? Numbers and types/extent of sequence duplications Repeat content and character Length and degree of degeneracy of repeats is important to know Cot-based methods can help to characterize the repeat classes in a genome Some newer strategies eliminate repeat content from libraries via Cot subtraction Polymorphism/inbreeding The extent to which an organism is inbred determines the degree of polymorphism in the genome A high degree of polymorphism can complicate assembly, depending upon the assembly algorithm used A preliminary assessment of polymorphism can determine the sequencing strategy 2

The desired end product determines strategy Finished means completed to contiguity and high quality with no/few gaps in the sequence Comparative draft means ordered and oriented contigs with gaps that are sized Draft means assembled into contigs (and supercontigs if possible), without order and orientation Supporting information Genetic maps typically provide context in terms of simple sequence repeats that occur near genes Physical maps provide a context for the sequence assembly to scaffold onto, or provide an ordered series of clones for clonebased sequencing The availability of genome sequence for a closely related organism can provide some support for assembly validation ESTs are partial gene sequences obtained by cloning and sequencing mrna populations General considerations Timeline Desired end product Library making capacity/ability $$ Algorithms available for sequence assembly Common sequencing strategies Whole genome shotgun (WGS) Clone-by-clone (physical map required) Hybrid approach (WGS and clone-byclone) Map-assisted WGS (physical map required) 3

Whole genome sequencing strategy Sequencing by whole genome shotgun using several subclone insert sizes/types Large insert clone end-sequences used for increased contig alignment and scaffolding Anchoring of assembled contigs to the genome accomplished through computational identification of known markers Whole genome shotgun genome Shotgun (paired ends) Assemble sequence reads Large insert clones scaffold WGS Strategy Advantages include few libraries and rapid accumulation of sequence information Disadvantages include lack of genomescale order/organization confirmation including repeats assembly WGS is best suited for low complexity, small genomes (e.g. bacterial) WGS Assembly algorithms Arachne PCAP Apollo Celera assembler Etc. All WGS algorithms deal poorly with repeated regions 4

Clone-by-clone strategy A clone-by-clone strategy involves the initial creation of a physical map of large-insert clones (BACs or fosmids) Physical maps are built by generating restriction digests of individual clones in excess (typically 15x coverage), then using computer-based algorithms to assemble contigs of related clones A minimal tile path is then generated, to provide minimally overlapping large insert clones that span the genome These clones then form the basis of sequencing efforts, including the creation of shotgun sequencing libraries from the individual tile path clones. Finishing is then done on a large insert clone basis, with computer assembly of finished clones to provide whole chromosome sequence. Bacterial Artificial Chromosomes (BACs) BACs are pieces of DNA called vectors that contain specific sequences: - a BAC can be put into a bacterial cell, such as E. coli - the BAC DNA is replicated and copies go to daughter cells during cell division - we try to put ~100,000 base long pieces of DNA into BACs - BAC clones containing genomic DNA pieces represent a manageable size of DNA for mapping studies they can be amplified, isolated and characterized by restriction digestion to determine what sequences they share BAC Fingerprints How are BAC fingerprints used?? 29,950 bp 540 bp We use them to create a fingerprint map A fingerprint map is a set of fingerprints (restriction digests of BACs) that are assembled into contigs. -Contigs are clusters of related BAC clones (they share component pieces, judging by their fingerprints) -Together, the BAC clones in a contig represent the genomic region from which the clones were derived FACT: The human and mouse genomes each required over 300,000 BAC fingerprints!! Digested BAC DNA is electrophoresed on 1.2% agarose gels for 8 hours at 3 V/cm. Each gel contains 96 sample and 25 marker lanes. How did we put all those together? 5

Computer-aided BAC fingerprint assembly Fingerprint contig assembly Individual gel lanes are identified Peaks are used to represent the sizes of the fragments A computer program called FPC is used to look at each BAC clone fingerprint and determine relationships Using FPC, overlapping clones with common restriction fragments are identified The process is repeated to assemble the contig until no further clones can be incorporated Manual editing of contigs is usually required * * * * A B C D E F G A contig from the human BAC fingerprint map What is the end result of fingerprint mapping? 1. Most BAC clones have been assembled into contigs. 2. The length (in base pairs) of the contigs is roughly equal to the anticipated size of the genome. 3. The BAC fingerprint map can be used to select a set of BAC clones that overlap one another to a minimal extent. These so-called minimal tiling path BACs will be used for the next phase of genomic mapping 6

Clone-by-clone strategy Advantage is provided by the physical map an independent means of confirming sequence assembly in advance of sequencing. A well developed algorithm for clone-based assembly is freely available (phrap) Disadvantages include large numbers & types of libraries required, need to wait for physical map to be ~complete before selecting minimal tile path for sequencing Mainly appropriate for large complex genomes with high degree of polymorphism Hybrid approach Assemble WGS reads and add to BAC shotgun reads STRATEGIES FOR THE SYSTEMATIC SEQUENCING OF COMPLEX GENOMES Eric D. Green, NATURE REVIEWS GENETICS VOLUME 2 AUGUST 2001 573 Mouse/Rat genome sequencing No need to follow the same path as the human genome Two main components: Whole genome shotgun high genome coverage rapid genome survey (aid human annotation) BAC-by-BAC BACs selected from fingerprint database low sequence coverage BAC end-sequences of good quality Hybrid approach strategy Advantages include lots of early data from WGS, time to build physical map, predetermined genome assembly Disadvantages include need to generate many different libraries/types, few/poor assembly algorithms available, expensive Mainly applied to large, more complex genomes 7

Map-assisted WGS strategy Maplink Viewer Physical map constructed of large insert clones concurrent with WGS sequencing of variable insert size subclones End sequences of mapped clones enable linking of map contigs to sequence assembly contigs, supercontigs result Linkage enables the organization of finishing into discrete units (supercontigs) of known order Map-assisted WGS strategy Ultimately, this strategy seeks to organize the sequence assembly using the map, and to complete the map using the sequence assembly Gap closure of the map and sequence can be accomplished by identifying gap-spanning clones, shotgun sequencing them, and assembling in the completed clone sequence to the assembly This is a newer strategy, and algorithms are currently being developed or modified to enable it Conclusions Many factors contribute to selecting a genome sequencing strategy Strategies and the algorithms to enable them are constantly being developed and refined Generating the production-style data for a genome project is rarely the rate-limiting step finishing and gap closure are Having a physical map is key to being able to verify a genome assembly, but newer strategies entwine the sequence and map building processes 8