Introduc)on to GBS. Hueber Yann, Alexis Dereeper, Gau)er Sarah, François Sabot, Vincent Ranwez, Jean- François Dufayard 02/11/2015

Size: px
Start display at page:

Download "Introduc)on to GBS. Hueber Yann, Alexis Dereeper, Gau)er Sarah, François Sabot, Vincent Ranwez, Jean- François Dufayard 02/11/2015"

Transcription

1 Introduc)on to GBS Hueber Yann, Alexis Dereeper, Gau)er Sarah, François Sabot, Vincent Ranwez, Jean- François Dufayard 02/11/2015

2 Index Defini)on Methodologies The RADseq (single, paired- end) example Bioinforma)c implica)ons Applica)ons (pro, cons) Pipelines

3 Defini)on GBS = Genotyping by Sequencing Genotyping de thousands of variants (SNP, INDEL) for many individuals Reduc)on of genome complexity à Restric)on enzyme usage NGS technologie usage (Illumina, etc..)

4 Why? Hundreds of individuals sequences simultaneously (Mul)plexing individuals on the same lane) Thousands/billions of markers Genome sampling è A]rac)ve prices

5 Complexity reduc)on With one (or several) restric)on enzyme Large choice Recogni)on sites with different sizes Sensi)vity to methyla)on (target genic regions, avoid repeted regions) Restric)on site DNAg

6 «GBS» methodologies Prepara)on of samples Restric)on site associated DNA markers Double digest RAD Genotyping by sequencing Reduce- representa)on library

7 RAD Diges&on Liga)on Diges)on : DNAg fragmenta)on with a restric)on enzyme Pooling Random shear Size selec)on Liga)on

8 RAD Diges)on Liga&on Liga)on : add a common adaptator + barcode Barcode = sequence of 4 to 8 bases, which allows to iden)fy individual Pooling Random shear Size selec)on Liga)on Adaptator 1 barcode Restric)on site DNAg

9 RAD Diges)on Pooling Liga)on Pooling Random shear Random shear Size selec&on Size selec)on Liga)on

10 RAD Diges)on Liga)on Liga)on Pooling Random shear PCR Size selec)on Liga&on Adaptator 1 barcode Restric)on site DNAg Adaptator 2

11 Single- end RAD : single vs paired- end DNAg Paired- end Restric)on site Read forward Read reverse SE : < 300 bp PE : 300 bp to 500 bp

12 RAD : paired- end con)g

13 RAD : paired- end con)g

14 RAD : single- end vs paired- end

15 Fichier fastq Example (2 firstreads) Read indiv 1 Read indiv ST1085:185:C30RAACXX:6:1101:2648:2087 1:N:0: TGCTTTGCAGCGTGATAAAGGTTTGCCAGAGAAGCTGCAGGCTCGCTCTCCTGGCGAATC ST1085:185:C30RAACXX:6:1101:2614:2089 1:N:0: ATAGATTGCAGCTGCCACTGCCGCAGCTGCCTCCCCTTCTCCTCTTCCTCGCTTCTTCCC +?@@DFFFDFHHGH>EGGIDEHIGIDGI>?DBB9DGGADFBBF@GGH4BAH@G@FBDCAEF barcode Restric&on site DNAg

16 Quality control Filter quality/length of reads Remove common adaptators Keep reads with no sequencing errors in the barcode + cuing site? If paired data: keep read 1 and read 2 corresponding in the same order in files fastq1 et fastq2 Check quality with FASTQC è Tools: cutadapt, trimmoma)c, etc..

17 Demul)plexing Obtain a unique fastq file for each individual FASTQ (reads correspondance to n individuals) Barcode file (correspondance NAME indiv <- - > barcode fastx_spli]er.pl TASSEL STACKS FASTQ (indiv 1) FASTQ (indiv 2) FASTQ (indiv n) Suppression of barcodes but not of restric)on sites!!

18 Applica)ons Linkage/QTL mapping Popula)on genomics Marker discovery Phylogene)cs/geography Genome assembly

19 Applica)ons Filtra&on pipeline on raw variants (SNPs, short indels) called on 106 accessions of Musa using GBS single- end methodology to get highly reliable markers for Genome Wide Associa&on Studies (GWAS). # Raw variants (SNPs, short indels) 1) Remove individuals with missing data > 50 % 2) Discard markers with one or more missing genotypes 148,108 46,418 3) Remove non- polymorphic markers 4) Keep only biallelic markers 5) Remove markers with Fis (inbreeding coefficient) score outside normal range of gaussian distribu)on (in our case inferior to - 0,8) 22,456 21,769 6) Keep markers with minor allele frequency (MAF) 5 % 7) Set to missing genotypes posi)ons with read depth < 10 8) Discard markers > 9 missing genotypes 5,544 # Analysis- ready variants

20 Applica)ons Phylogene&c trees generated with markers coming from a) GBS (3257 SNPs) and b) RAD sequencing (12880 SNPs) on 11 Musa diploids a) b) banksii BB balbisiana BB microcarpa ney poovan (cv) AB AB pisang jari buaya (cv) AA burmannicoïdes AA tomolo (cv) burmannica siamea zebrina pisang mas (cv)

21 Pros Simple and fast to implement No need of big quan)ty of DNA (100ng/indiv) Applicable to any species (with or without reference) Flexible : more or less markers depending of mul)plexing and coverage Exis)ng analyse pipelines

22 Cons The bigger is the library, the higher is the amount of missing data Polymorphisms in restric)on sites Structural varia)ons between individuals Heterogeneity of quali)es and quan))es of DNA Repeted sequences

23 Costs(ex library GBS 96 samples)

24 Pipeline TASSEL GBS (Cornell) Pipeline variants detec)on Tag = séquence de read unique

25 Logiciel TASSEL (v 5.0)

26 Pipeline STACKS

27 Pipeline STACKS SNP detec)on Gene)c cartography Mini- con)g construc)on (paired data) Popula)on genomics (with or without reference)

28 Bibiography Davey J.W., Hohenlohe P.A., E]er P.D., Boone J.Q., Catchen J.M., Blaxter M.L. (2011) Genome- wide gene)c marker discovery and genotyping using next- genera)on sequencing. Nature Reviews Gene)cs 12(7): Baird NA, E]er PD, Atwood TS, Currey MC, Shiver AL, Lewis ZA, Selker EU, Cresko WA and Johnson EA (2008) Rapid SNP discovery and gene)c mapping using sequenced RAD markers. PLoSONE 3: e3376. Bradbury PJ, Zhang Z, Kroon DE, Casstevens TM, Ramdoss Y, Buckler ES. (2007) TASSEL: Soywar for associa)on mapping of complex trai]s in diverse samples. Bioinforma)cs 23: J. Catchen, P. Hohenlohe, S. Bassham, A. Amores, and W. Cresko. Stacks: an analysis tool set for popula)on genomics. Molecular Ecology Karim Gharbi RAD sequencing: next- genera)on tools for an old problem (workshop Rennes 30/01/2014)