Visão geral de metodologias baseadas em sequenciamento de segunda geração para a identificação de polimorfismos de DNA e a genotipagem em larga escala Orzenil Bonfim da Silva Junior Embrapa Recursos Genéticos e Biotecnologia MCBio - Modelos computacionais para estabelecimento de meios e procedimentos metodológicos para análise de dados em bioinformática PA 3 - Sistematização e aplicação de modelos de análise de experimentos de associação em escala genômica
SNP interrogation and detection methods Locus-specific genotyping assays are uniquely designed to capture information about a given position in the genomes. Typically requires the use of specific oligonucleotides. With the rapid decrease in sequencing costs, we can simply re-sequence entire genomes. The sequencing cost and the complexity of assembling it are still to high.
Cost per Raw Megabase of DNA Sequence The cost of sequencing dropped substantially but the library construction still dominates the cost Genomes complexity can be reduced! DNA samples can be pooled!
SNP interrogation and detection methods Finding variation in a sample level working dataset is not sufficient to generalize to the population level for two reasons: 1. The variation could be specific to the individual, not generic to the population. 2. The variation could be due to artifacts (lib prep, sequencing errors, analytical). Hence large scale replication is needed to statistically validate the finding and disambiguate real variation from sequencing artifacts. Pooling is key to sequence at scale with a reasonable cost
Genotyping by Sequencing Restriction digestion Diversity Arrays Technology (DArT) Restriction site Associated DNA (RAD-Seq) GBS (Buckler Lab.) Sequence capture Selective primer PCR for Nextera tagmentation (nextrad) or Capture probes Low-coverage WGS data from pooled samples
DArT Jaccoud et al. 2001. Nucleic Acids Res. 29(4) proved robust to genome size and ploidy-level differences among approximately 60 organisms, including "orphan crops combines genome complexity reduction methods enriching for genic regions with a highly parallel assay readout on a number of "open-access" microarray platforms enabled a number of applications in which allelic frequencies can be estimated reflecting the level of DNA sequence variation in the tested loci
DArT polymorphism and variant test: a single DArT assay tests for polymorphism tens of thousands of genomic loci with the final number of markers reported reflecting the level of DNA sequence variation in the tested loci SNP interrogation in DArT is mediated by the high fidelity of restriction enzymes rather than primer annealing performs well in polyploid species such as wheat, banana or sugarcane, does not require any existing DNA-sequence information
Microarray-based DArT Captures a defined set of fragments from genomic DNA sample generated by restriction-enzyme digestion (a genomic representation) SNP (and InDel) polymorphisms at (or between) restriction-enzyme sites determine whether or not individual fragments are captured in the representation of a particular genotype (DArT marker)
Microarray-based DArT DArT markers in a mixture of genomic representation from a pool of individuals covering the genetic diversity of the species are cloned into a vector that is introduced into E. coli to form a library A selection of clones are arranged into a plate format with wells, amplified and spotted onto glass slides using a microarrayer to form a genotyping array
Microarray-based DArT Genotyping arrays are hybridised with genomic 'representations' of individual DNA samples prepared using the same complexity reduction method Individual 'representations' are labelled with one fluorescent label, while the vector fragment is labelled with another fluorescent label to act as a reference A marker is polymorphic if the relative hybridisation intensity across genotyping array falls into distinct clusters. Analysis of hybridisation intensities DArTsoft software for Genotypic data analysis
http://www.diversityarrays.com DArT hybridisation across array Each individual representation (target) will only hybridise to matching fragments on the genotyping array, thereby displaying a unique hybridisation pattern.
Eucalyptus DArT-array: development testing several genome complexity reduction methods was identified the PstI/TaqI method as the most effective 18 genomic libraries from PstI/TaqI representations of 64 different Eucalyptus species were developed 23,808 cloned DNA fragments were screened and 13,300 (56%) were found to be polymorphic among 284 individuals 7,680 DNA clones on the operational DArT array. All clones have been sequenced and made publicly available (Sansaloni et al. Plant Methods. 2010; 6: 16).
Eucalyptus DArT-array: development Sansaloni et al. Plant Methods. 2010; 6: 16
Eucalyptus DArT-array: validation and replication 1,152 clones developed from a genomic library of BRASUZ1 was also developed polymorphism test: 190 individuals with targets in full replication: 5,653 polymorphic markers (73.6%) average Call Rate and Reproducibility were 93.7% and 99.7% respectively linkage mapping test: 94 samples in full replication including samples from six mapping pedigrees (15-16 samples/each): 2,211 polymorphic markers per pedigree on average
Eucalyptus DArT-array: linkage mapping Sansaloni et al. Plant Methods. 2010; 6: 16
complexity reduction method (PstI/TaqI ) and PstI Adapter ligation Label with fluorescence (Cy3/Cy5) Wash, Scan and Analize with DArT soft Hybridize of the targets to the slide with Dart probes Production of DArT score table DArT array yielded polymorphic markers
NGS-based DArT combined use of DArT as a robust genome complexity reduction method with optimized barcoded representation of individual DNA samples for NGS PstI-site specific adapter is tagged with up to 96 different barcodes enabling encoding a plate of DNA samples to run within a single lane on an Illumina GAIIx PstI adapter also includes a sequencing primer, so that the tags generated were always reading into the genomic fragments from the PstI sites Analytical pipeline developed by DArT PL produces "DArT score" tables and "SNP" tables
NGS-based DArT: markers segregation A segregating population of 89 individuals derived from the intra-specific cross BRASUZ1 x M4D31 Correct parentage of all individuals was certified by microsatellite genotyping DNA samples of parents and progeny were processed for the conventional array-based DArT genotyping
NGS-based DArT: linkage mapping 148 million reads (76-bp) generated 2,835 polymorphic DArT polymorphic markers additional 3,341 SNPs confidently were genotyped A total number of 1,390 markers (1,065 DArT-NGS, 318 DArT markers and 7 SSR) were positioned on 10 chromosome scaffolds in framework map
Complexity reduction methods PstI_ad/TaqI/HpaII_ad PstI_ad/TaqI/HhaI_ad PstI adaptor added with different barcodes and sequencing primers FASTQ files (single end reads 76 bp) Illumina GAIIx single end sequencing up to 96 samples/lane Alignment of sequences on the Eucalyptus reference genome DArT NGS dominant polymorphic markers plus putatively scorable SNPs
http://www.diversityarrays.com
RAD-Seq Baird et al. 2008. PLoS One. 3(10):e3376 genomic representations of individual DNA sample is generated with restriction enzymes. Adapters are ligated to enzyme-cut fragments genomic representations from multiple individuals are pooled together and all fragments are randomly sheared RAD tags may be present or absent in specific individuals depending on the presence or absence of the enzymerestriction site (dominant markers) Polymorphic positions detected within the aligned tags provide additional co-dominant SNP markers
The process of RADSeq A-D: shearing with RE and adapters ligation Davey JW & Baxter M. 2010. Briefings in Functional Genomics (2010) 9 (5-6)
The process of RADSeq E-G: PCR Amplification, Illumina Sequencing and demultiplexing
RAD-Seq genotyping produces stochastic count data and requires sensitive analysis to develop or genotype markers accurately data is biased: restriction fragment, restriction site heterozygosity and PCR GC content RAD loci affected by different sources of bias can be excluded or processed for accurate genoytping
RAD-Seq: advantages in principle is unbiased with respect to many population genetics statistics (avoid known issues of ascertainment bias in marker sets) Use of paired-end sequencing have been used to attempt to reducing GC bias read 2 sequences up- or downstream of a particular restriction site can be assembled into 300- to 600-bp contigs (allows investigation of gene content) typically produce thousands to tens of thousands of markers
RAD-Seq: disavantages accuracy of automatic analysis tools is not yet clear (Davey et al. June 2013). the vast majority of publicly available RAD-Seq data are derived from populations with no reference genome or sequence variation information, making it difficult to validate RAD-Seq marker sets in any depth
RAD-Seq analysis typically proceeds by applying quality thresholds or likelihood ratio tests at multiple levels filtering by read coverage or by observing patterns of heterozygosity (excessively high observed heterozygosity and deviations from Hardy-Weinberg proportions) or segregation distortion (linkage mapping)
RAD-Seq: analysis challenges there is substantial variation in read depth beyond the expectation that read depth per RAD locus would cluster around a single mean with variance approximating a Gaussian distribution even at high coverages (difficults siteerror modelling) Lack of a per site-error model avoids telling a real SNP apart from an error because bases in the targeted region have different error rates
RAD-Seq analysis solutions While full statistical modeling of the effects biasing RAD data is not available, there are simple filters that can be applied to discard most affected RAD loci If a reference genome of reasonable quality is available, GATK should be able to call accurate genotypes at almost all loci, even those with severely skewed read depths On the absence of reference genome it may be possible to genotype RAD loci at heterozygous restriction sites accurately based on simultaneous assembling and genotype calls (see Cortex-Assembler)
Eucalyptus RAD-Seq RAD-Seq of a moderate set of individuals of two contrasting species to discover highly informative SNPs Assess the potential of RAD for direct genotyping-by-sequencing in Eucalyptus
Eucalyptus RAD-Seq: sequencing design Genomic representations of DNA samples of 18 unrelated trees for each one of the two species was generated using PstI (E. grandis and E. globulus) 6 sequencing bulks with six individuals per bulk given a theoretical coverage of 5X per individual/specie (~30X per bulked sample/specie) High coverage (~30x) genomic representation of Brasuz1 were generated following same restrictionbased method 76 bp single-end sequencing on a GAIIx [2-plexity bulk samples per lane (=3 lanes) + 1 lane for Brasuz1]
RAD Counter, University of Edinburgh, UK https://www.wiki.ed.ac.uk/display/radsequencing/home
Eucalyptus RAD-Seq: results RAD Counter estimated 86,083 expected PstI sites Estimate is close to the one derived directly from in silico digestion of the Brasuz1 genome (99,656 PstI sites) 74,258 RAD loci were generated across the genome distribution of the Brasuz1 PstI RAD tags: 73% of the PstI tags gave a total coverage > 30X remaining tags (27%) had at least a 10X coverage
Eucalytus RAD-Seq: results Out of the 99,656 PstI restriction sites predicted in silico, RAD successfully sampled 71,467 (72%) and 49,496 of the sites (49%) yielded sequence tags in the two directions out of the restriction site 90.24% of the RAD tags had successfully mapped to Brasuz1 genome after BQSR (novoalign+gatk)
Eucalyptus RAD-Seq: results Polymorphism test: 58,397 polymorphic markers (MQ>20; DP>15;NO MISSING CALL) 3,501 SNPs were simultaneously polymorphic in the two species Polymorphic markers have been placed into only 7,671 out of the 74,258 RAD loci sampled with an average of 2,24 SNP/loci
Sequence capture (RAPiD Target Seq) Neves L et al. 2013. The Plant Journal 75(1) Sequence specific, target regions of the genome by capturing them Capture probes are selected and designed to hybridize to unique, specific regions of interest Capture probes are derived from assembly of EST or RNA-Seq and efficiency in capture is high for probes that do not overlap multiple exons
Sequence capture (RAPiD Target Seq) Sequencing gives more flanking sequence for SNP identification and gene annotation Pilot test was delineated including 200 samples for high coverage genotyping 25,000 probes were derived from ssrna-seq combined with high coverage WGS sequencing (30x)
Sequence capture (nextrad) Johnson E & Etter P (not published) relies on selective primer PCR that only amplifies DNA fragments created by nextera tagmentation that start with a particular sequence focus the reads on particular loci throughout the genome researcher control the frequency of the loci by the length and composition of the selective primer
Sequence capture (nextrad) gives only one read per locus, instead of two divergent reads at a cut site, which is less redundant sequencing starts after the primer site, giving more flanking sequence for SNP identification run modes: low coverage scan, sequence at low (3X) or very low (<1X) coverage, or high coverage (25X) to get full genotypes of heterozygous loci Prices lower as $49/sample up to 75,000 loci (requires minimum 380 samples)
Sequence capture (nextrad) We are now analyzing real data shared by the company which developed the method Pilot test was delineated including 400 samples for high coverage genotyping
Low coverage WGS sequencing sample the whole genome of individuals obtain maximal information about population genetic parameters divides the sequencing effort maximally among individuals and obtain approximately one read per locus and individual Bayesian population models support inference from lower coverage than are required for simple likelihood models
Low coverage WGS sequencing Major drawback: analyses require genetic parameters for individuals, i.e., inference of population genetic parameters (allele frequencies) from observed sequence reads at loci, rather than rely only on the multiple steps of data cleaning, assembly and variant detection in the same data
Low coverage WGS sequencing Sequencing design must sample larger numbers of individuals and analytical steps should accept the resulting lower sequence coverage at each site to maximize the information obtained for populations Analytical steps should utilize explicit multilocus models for population parameters sequence reads for each individual (i) and locus (j) should be modeled from the genotype (i.e. as independent stochastic samples), with a site-error model for possible sequence errors
Low coverage WGS sequencing simulations should be used as a basis for analysis of the trade-off between numbers of sampled individuals and the depth of sequence coverage that can be achieved for a finite sequencing effort simulations that included stochastic variation around the expected sequence coverage led to very similar estimates of allele frequencies as those that utilized fixed, equal coverage among all individuals (Buerkle et al. Mol. Ecol 2013 Jun;22(11))
Low coverage WGS sequencing cost-saving measures: if inferences are to be made on populations (e.g., allele frequencies and derived statistics or parameters), little information is lost by labeling all individuals in a pool with the same barcode in applications where researchers need to recover information from allelic states in individuals (e.g.,linkage disequilibria among loci) pooling will be undesirable
Eucalyptus Low coverage WGS sequencing Don t miss our next class!
Molecular Ecology Special Issue: GENOTYPING BY SEQUENCING IN ECOLOGICAL AND CONSERVATION GENOMICS Volume 22, Issue 11,June 2013
Eucalyptus Genomic Selection Project Acknowledgments Dario Grattapaglia, Leader Marcos Resende Marcio Resende Jr. Roberto Togawa Orzenil Silva-Junior DArT Array Development Team Carolina Sansaloni Cesar Petroli Danielle Faria University of Tasmania Rene Vaillancourt Dorothy Steane University of Pretoria Zander Myburg Karina Zamprogno Alexandre Missiaggia Elizabete Takahashi Funding Brazilian Ministry of Science and Technology (FINEP, CNPq) EMBRAPA competitive grants BIOTEC Mercosur FAP-DF Forest companies