RADseq Data Analysis Workshop 3 February 2017

Size: px
Start display at page:

Download "RADseq Data Analysis Workshop 3 February 2017"

Transcription

1 RADseq Data Analysis Workshop 3 February 2017

2 Introduction to Galaxy (thanks to Simon Gladman for slides)

3 What is Galaxy? A web-based scalable workflow platform for genomic analysis Designed for biologists to work with their own data Has an App store for bioinformatics tools Retains histories of analyses Reproducible and sharable analyses

4 What is Galaxy? Analysis History Tools Menu

5 How it works Web interface The bit you interact with Compute Cluster In the Cloud Runs the jobs Web server Head node Sends jobs to the cluster Collates stuff

6 How do I use it? Point your browser at the web interface. Login or register as a user Upload your data Do analysis Interpret results Happy days!

7 What you get.. Your own space to work* Your data saved Your analysis history saved Re-analyse at whim Share data selectively *Not infinite space!

8 Getting data in Four main methods: Upload data from your computer Upload data from remote computer Upload data from a public dataset Import a shared dataset

9 Using tools Tool list on the left Grouped in categories Can be searched Tool interface in centre pane Fill in form with data, parameters Click EXECUTE Output will appear on right Admins can add more tools Tool list Tool interface EXECUTE bucon

10 Your data / tool output Data list on the right Uploaded/imported data Output from tools Analysis is saved collectively as a History You can have multiple histories You can copy, share and delete histories

11 File Traffic I m waiting I m running I m done

12 Viewing data / tool output View data in centre Click on the eye symbol on a data file Its contents appear in the centre

13 What is Stacks? a set of sodware tools for processing RAD-seq data

14 Why Stacks in Galaxy? to learn the steps and data formats involved in RAD-seq data processing to avoid command-line programming to easily share Stacks Galaxy workflows or histories with collaborators

15 Let s get started

16 An introduction to RADseq data analysis using STACKS pipeline! Dr. Sonika Tyagi BioinformaMcs Supervisor, AGRF EMBL-ABR Training CoordinaMon

17 Outline IntroducMon to the Illumina Data Quality Control STACKS workflow DemulMplexing Sample QC

18 Technology key points RAD-Seq is a fracmonal genome sequencing strategy, designed to interrogate anywhere from 0.1 to 10% of a selected genome. By acaching a series of adapters to the resulmng DNA fragments, large numbers of genemc variamons such as SNPs can be readily idenmfied from analysis of next generamon DNA sequence data.

19 Illumina Sequencing 19

20 Sequencing by Synthesis The first sequencing cycle begins by adding four labelled reversible terminators, primers and DNA polymerase Extend by 1 base Extend with A, C, G, T terminators Image determines the base Reverse terminamon cleavage step removes fluorescent dyes and prepares terminal group for next base Repeat Illuminafour-colour cyclic reversible terminamon (CRT) method

21 Sequencing by Synthesis

22 Illumina 1. Ligate adapters to random DNA fragments 2. ACach DNA to surface of flow cell channel 3. Bridge amplificamon 4. Double stranded DNA 5. Denature double strand 6. Complete AmplificaMon

23 23/02/17 Sequencing

24 Technology key points Reads (FASTQ) 24

25 Illumina Data Format and Quality Control 25

26 FASTA format >Seq_ID XYZ AGTGTAGCGATGCAT Multi FASTA >seq_1 AGCAGCTAGTCA >seq_2 AGTCAGTC >seq_3 CACATGCTAGC

27 FASTQ format TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCT + efcfffffcfeefffcffffffddf`feed]`]_ba_^ [YBBBBBBBBBBRTT\]][B MulM fastq (millions of reads per sample) Compressed.bz,.gz

28 FASTQ Sequence TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCT + efcfffffcfeefffcffffffddf`feed]`]_ba_^ [YBBBBBBBBBBRTT\]][B

29 FASTQ ID - TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCT + efcfffffcfeefffcffffffddf`feed]`]_ba_^ [YBBBBBBBBBBRTT\]][B

30 FASTQ Quality TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCT + efcfffffcfeefffcffffffddf`feed]`]_ba_^ [YBBBBBBBBBBRTT\]][B

31 TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCT + efcfffffcfeefffcffffffddf`feed]`]_ba_^ [YBBBBBBBBBBRTT\]][B Each base has an associated Phred quality score (0-40)

32 TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCT + efcfffffcfeefffcffffffddf`feed]`]_ba_^ [YBBBBBBBBBBRTT\]][B Each base has an associated Phred quality score (0-40)

33 Phred Scores and Error Probability Phred Quality Score Error Probability Accuracy of the base call (%)

34 TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCT + efcfffffcfeefffcffffffddf`feed]`]_ba_^ [YBBBBBBBBBBRTT\]][B Each base has an associated Phred quality score (0-40) Represented by a single printable ASCII character!"#$%&'()*+,-./ :;<=>?@abcdefghijklmnopqrstuvwxyz[\]^_`abcdefgh Printable ASCII characters

35 TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCT + efcfffffcfeefffcffffffddf`feed]`]_ba_^ [YBBBBBBBBBBRTT\]][B Each base has an associated Phred quality score (0-40) Represented by a single printable ASCII character Decimal values !"#$%&'()*+,-./ :;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefgh

36 FASTQ Encoding/Offset Sanger (Phred+33) Illumina 1.8+ (Phred+33) Solexa (Phred+64) Illumina 1.3+ (Phred+64) Illumina 1.5+ (Phred+64)

37 FASTQ Encoding/Offset Sanger (Phred+33) Illumina 1.8+ (Phred+33) Problem area Solexa (Phred+64) Illumina 1.3+ (Phred+64) Illumina 1.5+ (Phred+64)

38 Quality score visualisamon (box plot) Q20 X axis base position Y axis quality score (0-40) Q20=99% accuracy FastQC:

39 Check Quality Scores FastQC: hcp://

40 Read Trimming and Adapter clipping 1) Fixed length trimming Remove all bases before/ader a given base Quick, computamonally easy Throws away good data and shortens reads 2) Quality-based trimming Remove bases following a "poor" quality base Retains more good data Throws away good data if there is a single spurious "poor" quality base

41 Main BioinformaMcs pipelines STACKS Website: hcp://catchenlab.life.illinois.edu/stacks/ mbrad, ddrad, ezrad & 2bRAD? STACKS does not handle INDELS, so any loci near an INDEL is lost STACKS does not call SNPs from paired end reads namvely, and does especially poorly with paired end fragments that are not of a random length (e.g., ddrad and ezrad) ddocent Website: hcps://ddocent.wordpress.com/ddocent-pipeline-user-guide/ ddrad & ezrad PyRAD Website: hcp://dereneaton.com/sodware/pyrad/ mbrad, ddrad, PE-ddRAD, GBS, PE-GBS, EzRAD, PE-EzRAD, 2B-RAD use of an alignment-clustering method (vsearch) 2bRAD (Wang et al 2012) de novo: hcps://github.com/z0on/2brad_denovo With reference genome: hcps://github.com/z0on/2brad_gatk 2bRAD

42 Stacks workflow Step1: Assessing the overall quality of the data Fixed length Trimming may be. Step2: process_redtags DemulMplexing: Assigning reads to individual samples using the unique barcodes in the reads. Required clean up is done as part of this process. 42

43 Stacks workflow Step2: process_redtags 1. Sample Barcodes are not Illumina index. These are included in sequence. Which requires inline barcodes demulmplexing. (Note: PE reads only have barcode in the 1st read) 2. All reads are expected to have restricmon enzyme site signature ader barcode sequence. 3. Barcode length variamon may need separate instance for each barcode size. 43

44 Example data from 131N Raw Data: lib-1_abcdeffxx_gccaat_l1_r1.fastq.gz SampleSheet: Barcode Sample-ID UsedForTagCatalogue AACT 20000_ddRAD_AGRF CCTA 20001_ddRAD_AGRF TTAC 28368_ddRAD_AGRF AGGC 28369_ddRAD_AGRF GCAT 28370_ddRAD_AGRF Y Y Y Y Y TAGA 28371_ddRAD_AGRF Y DemulMplexed fastq: 20000_ddRAD_AGRF_AACT.fq.gz, 20001_ddRAD_AGRF_CCTA.fq.gz

45 Stacks workflow: Step2: process_redtags (QC) Total number of reads Number of tags (stacks) per sample Number of common tags (typed across all samples) Average depth (reads NN) per tags N Tags in Catalogue Average Tag Depth SD of Depth Sample Tags Total Reads ,168 51, ,824 61, ,820 59, ,725 29, ,351 36, ,250 38,

46 QC: demulmplexed stats 46

47 QC: Yields by barcode length 47

48 Analysis of GBS repormng 1. (QC) Total number of reads Number of tags (stacks) per sample Number of common tags (typed across all samples) Average depth (reads NN) per tags 2. Results with variant calls Consensus sequence Number of samples genotyped Observed Alleles Individual genotypes allele coverage Other things in 3 rd party tools format: VCF, structure, phylip, etc.

49 Talk part2 49

50 Analysis of GBS repormng 1. (QC) Total number of reads Number of tags (stacks) per sample Number of common tags (typed across all samples) Average depth (reads NN) per tags 2. Results with variant calls Consensus sequence Number of samples genotyped Observed Alleles Individual genotypes allele coverage Other things in 3 rd party tools format: VCF, structure, phylip, etc.

51 Acknowledgements AGRF NGS lab Jafar Jabbari MaChew Tinning and team AGRF Genotyping Lab Melinda Zino and team EMBl-ABR: Vicky Schneider Phillipa Griffin