Visão geral de metodologias baseadas em sequenciamento de segunda geração para a identificação de polimorfismos de DNA e a genotipagem em larga escala

Similar documents
Genomic resources. for non-model systems

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer.

RADSeq Data Analysis. Through STACKS on Galaxy. Yvan Le Bras Anthony Bretaudeau Cyril Monjeaud Gildas Le Corguillé

Next-generation sequencing technologies

Application of Genotyping-By-Sequencing and Genome-Wide Association Analysis in Tetraploid Potato

Marker types. Potato Association of America Frederiction August 9, Allen Van Deynze

Molecular markers in plant systematics and population biology

Comparison and Evaluation of Cotton SNPs Developed by Transcriptome, Genome Reduction on Restriction Site Conservation and RAD-based Sequencing

B) You can conclude that A 1 is identical by descent. Notice that A2 had to come from the father (and therefore, A1 is maternal in both cases).

Molecular Markers CRITFC Genetics Workshop December 9, 2014

Get to Know Your DNA. Every Single Fragment.

Measuring and Understanding Gene Expression

SolCAP. Executive Commitee : David Douches Walter De Jong Robin Buell David Francis Alexandra Stone Lukas Mueller AllenVan Deynze

I.1 The Principle: Identification and Application of Molecular Markers

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Number and length distributions of the inferred fosmids.

Bioinformatics Advice on Experimental Design

INTERNATIONAL UNION FOR THE PROTECTION OF NEW VARIETIES OF PLANTS

DNBseq TM SERVICE OVERVIEW Plant and Animal Whole Genome Re-Sequencing

Contact us for more information and a quotation

Add 2016 GBS Poster As Slide One

Genome Projects. Part III. Assembly and sequencing of human genomes

Lecture 12. Genomics. Mapping. Definition Species sequencing ESTs. Why? Types of mapping Markers p & Types

GBS Usage Cases: Non-model Organisms. Katie E. Hyma, PhD Bioinformatics Core Institute for Genomic Diversity Cornell University

Supplementary Information for:

Midterm 1 Results. Midterm 1 Akey/ Fields Median Number of Students. Exam Score

Quality assurance in NGS (diagnostics)

INTERNATIONAL UNION FOR THE PROTECTION OF NEW VARIETIES OF PLANTS GENEVA

Deep Sequencing technologies

Chapter 5. Structural Genomics

Welcome to the NGS webinar series

TBRT Meeting April 2018 Scott Weigel Sales Director

The New Genome Analyzer IIx Delivering more data, faster, and easier than ever before. Jeremy Preston, PhD Marketing Manager, Sequencing

GDMS Templates Documentation GDMS Templates Release 1.0

PCR-based technologies Latest strategies

solid S Y S T E M s e q u e n c i n g See the Difference Discover the Quality Genome

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

Outline General NGS background and terms 11/14/2016 CONFLICT OF INTEREST. HLA region targeted enrichment. NGS library preparation methodologies

PCB Fa Falll l2012

Frequently asked questions

Authors: Vivek Sharma and Ram Kunwar

Lecture 7. Next-generation sequencing technologies

Personal Genomics Platform White Paper Last Updated November 15, Executive Summary

SEQUENCING FROM SAMPLE TO SEQUENCE READY

The Genome Analysis Centre. Building Excellence in Genomics and Computa5onal Bioscience

Deoxyribonucleic Acid DNA

Introduction to BioMEMS & Medical Microdevices DNA Microarrays and Lab-on-a-Chip Methods

Human Genome Sequencing Over the Decades The capacity to sequence all 3.2 billion bases of the human genome (at 30X coverage) has increased

Matthew Tinning Australian Genome Research Facility. July 2012

Genomic Technologies. Michael Schatz. Feb 1, 2018 Lecture 2: Applied Comparative Genomics

Introduc)on to GBS. Hueber Yann, Alexis Dereeper, Gau)er Sarah, François Sabot, Vincent Ranwez, Jean- François Dufayard 02/11/2015

Functional genomics to improve wheat disease resistance. Dina Raats Postdoctoral Scientist, Krasileva Group

Introduction to some aspects of molecular genetics

Harnessing the power of RADseq for ecological and evolutionary genomics

HaloPlex HS. Get to Know Your DNA. Every Single Fragment. Kevin Poon, Ph.D.

Novel methods for RNA and DNA- Seq analysis using SMART Technology. Andrew Farmer, D. Phil. Vice President, R&D Clontech Laboratories, Inc.

Genome 373: High- Throughput DNA Sequencing. Doug Fowler

Analysis of genome-wide genotype data

RNA-Seq data analysis course September 7-9, 2015

Digital genotyping of sorghum a diverse plant species with a large repeat-rich genome

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS.

Illumina s Suite of Targeted Resequencing Solutions

WORKING GROUP ON BIOCHEMICAL AND MOLECULAR TECHNIQUES AND DNA PROFILING IN PARTICULAR. Eleventh Session Madrid, September 16 to 18, 2008

The Diploid Genome Sequence of an Individual Human

Summary of Proposed Revisions to the 2013 Standards November 2014

Applicazioni biotecnologiche

Using molecular marker technology in studies on plant genetic diversity Final considerations

CAP BIOINFORMATICS Su-Shing Chen CISE. 10/5/2005 Su-Shing Chen, CISE 1

CM581A2: NEXT GENERATION SEQUENCING PLATFORMS AND LIBRARY GENERATION

Using mutants to clone genes

NextGen Sequencing Technologies Sequencing overview

latestdevelopments relevant for the Ag sector André Eggen Agriculture Segment Manager, Europe

Genome-Wide Association Studies (GWAS): Computational Them

Axiom mydesign Custom Array design guide for human genotyping applications

Design. Construction. Characterization

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Human genetic variation

Development and characterization of a high throughput targeted genotypingby-sequencing solution for agricultural genetic applications

Expressed genes profiling (Microarrays) Overview Of Gene Expression Control Profiling Of Expressed Genes

Targeted RNA sequencing reveals the deep complexity of the human transcriptome.

Functional Genomics Research Stream. Research Meetings: November 2 & 3, 2009 Next Generation Sequencing

Nextera DNA Sample Prep Kit (Roche FLX-compatible)

Wet-lab Considerations for Illumina data analysis

SNPs - GWAS - eqtls. Sebastian Schmeier

Phenotype analysis: biological-biochemical analysis. Genotype analysis: molecular and physical analysis

Motivation From Protein to Gene

Biology 445K Winter 2007 DNA Fingerprinting

DNA METHYLATION RESEARCH TOOLS

GENETICS EXAM 3 FALL a) is a technique that allows you to separate nucleic acids (DNA or RNA) by size.

High Throughput Sequencing the Multi-Tool of Life Sciences. Lutz Froenicke DNA Technologies and Expression Analysis Cores UCD Genome Center

Next generation sequencing techniques" Toma Tebaldi Centre for Integrative Biology University of Trento

Supplementary Figure 1 Genotyping by Sequencing (GBS) pipeline used in this study to genotype maize inbred lines. The 14,129 maize inbred lines were

Phenotype analysis: biological-biochemical analysis. Genotype analysis: molecular and physical analysis

Course Overview: Mutation Detection Using Massively Parallel Sequencing

Surely Better Target Enrichment from Sample to Sequencer

Next Gen Sequencing. Expansion of sequencing technology. Contents

Chapter 8: Recombinant DNA. Ways this technology touches us. Overview. Genetic Engineering

Nextera DNA Sample Prep Kit

Complete protocol in 110 minutes Enzymatic fragmentation without sonication One-step fragmentation/tagging to save time

APPLICATION NOTE

Genomic prediction of complex phenotypes: Driving innovation in the Brazilian forest based industry. Dario Grattapaglia

Transcription:

Visão geral de metodologias baseadas em sequenciamento de segunda geração para a identificação de polimorfismos de DNA e a genotipagem em larga escala Orzenil Bonfim da Silva Junior Embrapa Recursos Genéticos e Biotecnologia MCBio - Modelos computacionais para estabelecimento de meios e procedimentos metodológicos para análise de dados em bioinformática PA 3 - Sistematização e aplicação de modelos de análise de experimentos de associação em escala genômica

SNP interrogation and detection methods Locus-specific genotyping assays are uniquely designed to capture information about a given position in the genomes. Typically requires the use of specific oligonucleotides. With the rapid decrease in sequencing costs, we can simply re-sequence entire genomes. The sequencing cost and the complexity of assembling it are still to high.

Cost per Raw Megabase of DNA Sequence The cost of sequencing dropped substantially but the library construction still dominates the cost Genomes complexity can be reduced! DNA samples can be pooled!

SNP interrogation and detection methods Finding variation in a sample level working dataset is not sufficient to generalize to the population level for two reasons: 1. The variation could be specific to the individual, not generic to the population. 2. The variation could be due to artifacts (lib prep, sequencing errors, analytical). Hence large scale replication is needed to statistically validate the finding and disambiguate real variation from sequencing artifacts. Pooling is key to sequence at scale with a reasonable cost

Genotyping by Sequencing Restriction digestion Diversity Arrays Technology (DArT) Restriction site Associated DNA (RAD-Seq) GBS (Buckler Lab.) Sequence capture Selective primer PCR for Nextera tagmentation (nextrad) or Capture probes Low-coverage WGS data from pooled samples

DArT Jaccoud et al. 2001. Nucleic Acids Res. 29(4) proved robust to genome size and ploidy-level differences among approximately 60 organisms, including "orphan crops combines genome complexity reduction methods enriching for genic regions with a highly parallel assay readout on a number of "open-access" microarray platforms enabled a number of applications in which allelic frequencies can be estimated reflecting the level of DNA sequence variation in the tested loci

DArT polymorphism and variant test: a single DArT assay tests for polymorphism tens of thousands of genomic loci with the final number of markers reported reflecting the level of DNA sequence variation in the tested loci SNP interrogation in DArT is mediated by the high fidelity of restriction enzymes rather than primer annealing performs well in polyploid species such as wheat, banana or sugarcane, does not require any existing DNA-sequence information

Microarray-based DArT Captures a defined set of fragments from genomic DNA sample generated by restriction-enzyme digestion (a genomic representation) SNP (and InDel) polymorphisms at (or between) restriction-enzyme sites determine whether or not individual fragments are captured in the representation of a particular genotype (DArT marker)

Microarray-based DArT DArT markers in a mixture of genomic representation from a pool of individuals covering the genetic diversity of the species are cloned into a vector that is introduced into E. coli to form a library A selection of clones are arranged into a plate format with wells, amplified and spotted onto glass slides using a microarrayer to form a genotyping array

Microarray-based DArT Genotyping arrays are hybridised with genomic 'representations' of individual DNA samples prepared using the same complexity reduction method Individual 'representations' are labelled with one fluorescent label, while the vector fragment is labelled with another fluorescent label to act as a reference A marker is polymorphic if the relative hybridisation intensity across genotyping array falls into distinct clusters. Analysis of hybridisation intensities DArTsoft software for Genotypic data analysis

http://www.diversityarrays.com DArT hybridisation across array Each individual representation (target) will only hybridise to matching fragments on the genotyping array, thereby displaying a unique hybridisation pattern.

Eucalyptus DArT-array: development testing several genome complexity reduction methods was identified the PstI/TaqI method as the most effective 18 genomic libraries from PstI/TaqI representations of 64 different Eucalyptus species were developed 23,808 cloned DNA fragments were screened and 13,300 (56%) were found to be polymorphic among 284 individuals 7,680 DNA clones on the operational DArT array. All clones have been sequenced and made publicly available (Sansaloni et al. Plant Methods. 2010; 6: 16).

Eucalyptus DArT-array: development Sansaloni et al. Plant Methods. 2010; 6: 16

Eucalyptus DArT-array: validation and replication 1,152 clones developed from a genomic library of BRASUZ1 was also developed polymorphism test: 190 individuals with targets in full replication: 5,653 polymorphic markers (73.6%) average Call Rate and Reproducibility were 93.7% and 99.7% respectively linkage mapping test: 94 samples in full replication including samples from six mapping pedigrees (15-16 samples/each): 2,211 polymorphic markers per pedigree on average

Eucalyptus DArT-array: linkage mapping Sansaloni et al. Plant Methods. 2010; 6: 16

complexity reduction method (PstI/TaqI ) and PstI Adapter ligation Label with fluorescence (Cy3/Cy5) Wash, Scan and Analize with DArT soft Hybridize of the targets to the slide with Dart probes Production of DArT score table DArT array yielded polymorphic markers

NGS-based DArT combined use of DArT as a robust genome complexity reduction method with optimized barcoded representation of individual DNA samples for NGS PstI-site specific adapter is tagged with up to 96 different barcodes enabling encoding a plate of DNA samples to run within a single lane on an Illumina GAIIx PstI adapter also includes a sequencing primer, so that the tags generated were always reading into the genomic fragments from the PstI sites Analytical pipeline developed by DArT PL produces "DArT score" tables and "SNP" tables

NGS-based DArT: markers segregation A segregating population of 89 individuals derived from the intra-specific cross BRASUZ1 x M4D31 Correct parentage of all individuals was certified by microsatellite genotyping DNA samples of parents and progeny were processed for the conventional array-based DArT genotyping

NGS-based DArT: linkage mapping 148 million reads (76-bp) generated 2,835 polymorphic DArT polymorphic markers additional 3,341 SNPs confidently were genotyped A total number of 1,390 markers (1,065 DArT-NGS, 318 DArT markers and 7 SSR) were positioned on 10 chromosome scaffolds in framework map

Complexity reduction methods PstI_ad/TaqI/HpaII_ad PstI_ad/TaqI/HhaI_ad PstI adaptor added with different barcodes and sequencing primers FASTQ files (single end reads 76 bp) Illumina GAIIx single end sequencing up to 96 samples/lane Alignment of sequences on the Eucalyptus reference genome DArT NGS dominant polymorphic markers plus putatively scorable SNPs

http://www.diversityarrays.com

RAD-Seq Baird et al. 2008. PLoS One. 3(10):e3376 genomic representations of individual DNA sample is generated with restriction enzymes. Adapters are ligated to enzyme-cut fragments genomic representations from multiple individuals are pooled together and all fragments are randomly sheared RAD tags may be present or absent in specific individuals depending on the presence or absence of the enzymerestriction site (dominant markers) Polymorphic positions detected within the aligned tags provide additional co-dominant SNP markers

The process of RADSeq A-D: shearing with RE and adapters ligation Davey JW & Baxter M. 2010. Briefings in Functional Genomics (2010) 9 (5-6)

The process of RADSeq E-G: PCR Amplification, Illumina Sequencing and demultiplexing

RAD-Seq genotyping produces stochastic count data and requires sensitive analysis to develop or genotype markers accurately data is biased: restriction fragment, restriction site heterozygosity and PCR GC content RAD loci affected by different sources of bias can be excluded or processed for accurate genoytping

RAD-Seq: advantages in principle is unbiased with respect to many population genetics statistics (avoid known issues of ascertainment bias in marker sets) Use of paired-end sequencing have been used to attempt to reducing GC bias read 2 sequences up- or downstream of a particular restriction site can be assembled into 300- to 600-bp contigs (allows investigation of gene content) typically produce thousands to tens of thousands of markers

RAD-Seq: disavantages accuracy of automatic analysis tools is not yet clear (Davey et al. June 2013). the vast majority of publicly available RAD-Seq data are derived from populations with no reference genome or sequence variation information, making it difficult to validate RAD-Seq marker sets in any depth

RAD-Seq analysis typically proceeds by applying quality thresholds or likelihood ratio tests at multiple levels filtering by read coverage or by observing patterns of heterozygosity (excessively high observed heterozygosity and deviations from Hardy-Weinberg proportions) or segregation distortion (linkage mapping)

RAD-Seq: analysis challenges there is substantial variation in read depth beyond the expectation that read depth per RAD locus would cluster around a single mean with variance approximating a Gaussian distribution even at high coverages (difficults siteerror modelling) Lack of a per site-error model avoids telling a real SNP apart from an error because bases in the targeted region have different error rates

RAD-Seq analysis solutions While full statistical modeling of the effects biasing RAD data is not available, there are simple filters that can be applied to discard most affected RAD loci If a reference genome of reasonable quality is available, GATK should be able to call accurate genotypes at almost all loci, even those with severely skewed read depths On the absence of reference genome it may be possible to genotype RAD loci at heterozygous restriction sites accurately based on simultaneous assembling and genotype calls (see Cortex-Assembler)

Eucalyptus RAD-Seq RAD-Seq of a moderate set of individuals of two contrasting species to discover highly informative SNPs Assess the potential of RAD for direct genotyping-by-sequencing in Eucalyptus

Eucalyptus RAD-Seq: sequencing design Genomic representations of DNA samples of 18 unrelated trees for each one of the two species was generated using PstI (E. grandis and E. globulus) 6 sequencing bulks with six individuals per bulk given a theoretical coverage of 5X per individual/specie (~30X per bulked sample/specie) High coverage (~30x) genomic representation of Brasuz1 were generated following same restrictionbased method 76 bp single-end sequencing on a GAIIx [2-plexity bulk samples per lane (=3 lanes) + 1 lane for Brasuz1]

RAD Counter, University of Edinburgh, UK https://www.wiki.ed.ac.uk/display/radsequencing/home

Eucalyptus RAD-Seq: results RAD Counter estimated 86,083 expected PstI sites Estimate is close to the one derived directly from in silico digestion of the Brasuz1 genome (99,656 PstI sites) 74,258 RAD loci were generated across the genome distribution of the Brasuz1 PstI RAD tags: 73% of the PstI tags gave a total coverage > 30X remaining tags (27%) had at least a 10X coverage

Eucalytus RAD-Seq: results Out of the 99,656 PstI restriction sites predicted in silico, RAD successfully sampled 71,467 (72%) and 49,496 of the sites (49%) yielded sequence tags in the two directions out of the restriction site 90.24% of the RAD tags had successfully mapped to Brasuz1 genome after BQSR (novoalign+gatk)

Eucalyptus RAD-Seq: results Polymorphism test: 58,397 polymorphic markers (MQ>20; DP>15;NO MISSING CALL) 3,501 SNPs were simultaneously polymorphic in the two species Polymorphic markers have been placed into only 7,671 out of the 74,258 RAD loci sampled with an average of 2,24 SNP/loci

Sequence capture (RAPiD Target Seq) Neves L et al. 2013. The Plant Journal 75(1) Sequence specific, target regions of the genome by capturing them Capture probes are selected and designed to hybridize to unique, specific regions of interest Capture probes are derived from assembly of EST or RNA-Seq and efficiency in capture is high for probes that do not overlap multiple exons

Sequence capture (RAPiD Target Seq) Sequencing gives more flanking sequence for SNP identification and gene annotation Pilot test was delineated including 200 samples for high coverage genotyping 25,000 probes were derived from ssrna-seq combined with high coverage WGS sequencing (30x)

Sequence capture (nextrad) Johnson E & Etter P (not published) relies on selective primer PCR that only amplifies DNA fragments created by nextera tagmentation that start with a particular sequence focus the reads on particular loci throughout the genome researcher control the frequency of the loci by the length and composition of the selective primer

Sequence capture (nextrad) gives only one read per locus, instead of two divergent reads at a cut site, which is less redundant sequencing starts after the primer site, giving more flanking sequence for SNP identification run modes: low coverage scan, sequence at low (3X) or very low (<1X) coverage, or high coverage (25X) to get full genotypes of heterozygous loci Prices lower as $49/sample up to 75,000 loci (requires minimum 380 samples)

Sequence capture (nextrad) We are now analyzing real data shared by the company which developed the method Pilot test was delineated including 400 samples for high coverage genotyping

Low coverage WGS sequencing sample the whole genome of individuals obtain maximal information about population genetic parameters divides the sequencing effort maximally among individuals and obtain approximately one read per locus and individual Bayesian population models support inference from lower coverage than are required for simple likelihood models

Low coverage WGS sequencing Major drawback: analyses require genetic parameters for individuals, i.e., inference of population genetic parameters (allele frequencies) from observed sequence reads at loci, rather than rely only on the multiple steps of data cleaning, assembly and variant detection in the same data

Low coverage WGS sequencing Sequencing design must sample larger numbers of individuals and analytical steps should accept the resulting lower sequence coverage at each site to maximize the information obtained for populations Analytical steps should utilize explicit multilocus models for population parameters sequence reads for each individual (i) and locus (j) should be modeled from the genotype (i.e. as independent stochastic samples), with a site-error model for possible sequence errors

Low coverage WGS sequencing simulations should be used as a basis for analysis of the trade-off between numbers of sampled individuals and the depth of sequence coverage that can be achieved for a finite sequencing effort simulations that included stochastic variation around the expected sequence coverage led to very similar estimates of allele frequencies as those that utilized fixed, equal coverage among all individuals (Buerkle et al. Mol. Ecol 2013 Jun;22(11))

Low coverage WGS sequencing cost-saving measures: if inferences are to be made on populations (e.g., allele frequencies and derived statistics or parameters), little information is lost by labeling all individuals in a pool with the same barcode in applications where researchers need to recover information from allelic states in individuals (e.g.,linkage disequilibria among loci) pooling will be undesirable

Eucalyptus Low coverage WGS sequencing Don t miss our next class!

Molecular Ecology Special Issue: GENOTYPING BY SEQUENCING IN ECOLOGICAL AND CONSERVATION GENOMICS Volume 22, Issue 11,June 2013

Eucalyptus Genomic Selection Project Acknowledgments Dario Grattapaglia, Leader Marcos Resende Marcio Resende Jr. Roberto Togawa Orzenil Silva-Junior DArT Array Development Team Carolina Sansaloni Cesar Petroli Danielle Faria University of Tasmania Rene Vaillancourt Dorothy Steane University of Pretoria Zander Myburg Karina Zamprogno Alexandre Missiaggia Elizabete Takahashi Funding Brazilian Ministry of Science and Technology (FINEP, CNPq) EMBRAPA competitive grants BIOTEC Mercosur FAP-DF Forest companies