RADseq Data Analysis Workshop 3 February 2017

Similar documents
Transcription:

RADseq Data Analysis Workshop 3 February 2017

Introduction to Galaxy (thanks to Simon Gladman for slides)

What is Galaxy? A web-based scalable workflow platform for genomic analysis Designed for biologists to work with their own data Has an App store for bioinformatics tools Retains histories of analyses Reproducible and sharable analyses

What is Galaxy? Analysis History Tools Menu

How it works Web interface The bit you interact with Compute Cluster In the Cloud Runs the jobs Web server Head node Sends jobs to the cluster Collates stuff

How do I use it? Point your browser at the web interface. Login or register as a user Upload your data Do analysis Interpret results Happy days!

What you get.. Your own space to work* Your data saved Your analysis history saved Re-analyse at whim Share data selectively *Not infinite space!

Getting data in Four main methods: Upload data from your computer Upload data from remote computer Upload data from a public dataset Import a shared dataset

Using tools Tool list on the left Grouped in categories Can be searched Tool interface in centre pane Fill in form with data, parameters Click EXECUTE Output will appear on right Admins can add more tools Tool list Tool interface EXECUTE bucon

Your data / tool output Data list on the right Uploaded/imported data Output from tools Analysis is saved collectively as a History You can have multiple histories You can copy, share and delete histories

File Traffic I m waiting I m running I m done

Viewing data / tool output View data in centre Click on the eye symbol on a data file Its contents appear in the centre

What is Stacks? a set of sodware tools for processing RAD-seq data http://catchenlab.life.illinois.edu/stacks/

Why Stacks in Galaxy? to learn the steps and data formats involved in RAD-seq data processing to avoid command-line programming to easily share Stacks Galaxy workflows or histories with collaborators

Let s get started

An introduction to RADseq data analysis using STACKS pipeline! Dr. Sonika Tyagi BioinformaMcs Supervisor, AGRF EMBL-ABR Training CoordinaMon

Outline IntroducMon to the Illumina Data Quality Control STACKS workflow DemulMplexing Sample QC

Technology key points RAD-Seq is a fracmonal genome sequencing strategy, designed to interrogate anywhere from 0.1 to 10% of a selected genome. By acaching a series of adapters to the resulmng DNA fragments, large numbers of genemc variamons such as SNPs can be readily idenmfied from analysis of next generamon DNA sequence data.

Illumina Sequencing 19

Sequencing by Synthesis The first sequencing cycle begins by adding four labelled reversible terminators, primers and DNA polymerase Extend by 1 base Extend with A, C, G, T terminators Image determines the base Reverse terminamon cleavage step removes fluorescent dyes and prepares terminal group for next base Repeat Illuminafour-colour cyclic reversible terminamon (CRT) method

Sequencing by Synthesis

Illumina 1. Ligate adapters to random DNA fragments 2. ACach DNA to surface of flow cell channel 3. Bridge amplificamon 4. Double stranded DNA 5. Denature double strand 6. Complete AmplificaMon

23/02/17 Sequencing

Technology key points Reads (FASTQ) 24

Illumina Data Format and Quality Control 25

FASTA format >Seq_ID XYZ AGTGTAGCGATGCAT Multi FASTA >seq_1 AGCAGCTAGTCA >seq_2 AGTCAGTC >seq_3 CACATGCTAGC

FASTQ format ID @HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1 TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCT + efcfffffcfeefffcffffffddf`feed]`]_ba_^ [YBBBBBBBBBBRTT\]][B MulM fastq (millions of reads per sample) Compressed.bz,.gz

FASTQ Sequence string @HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1 TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCT + efcfffffcfeefffcffffffddf`feed]`]_ba_^ [YBBBBBBBBBBRTT\]][B

FASTQ ID - opmonal @HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1 TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCT + efcfffffcfeefffcffffffddf`feed]`]_ba_^ [YBBBBBBBBBBRTT\]][B

FASTQ Quality string @HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1 TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCT + efcfffffcfeefffcffffffddf`feed]`]_ba_^ [YBBBBBBBBBBRTT\]][B

FASTQ @HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1 TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCT + efcfffffcfeefffcffffffddf`feed]`]_ba_^ [YBBBBBBBBBBRTT\]][B Each base has an associated Phred quality score (0-40)

FASTQ @HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1 TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCT + efcfffffcfeefffcffffffddf`feed]`]_ba_^ [YBBBBBBBBBBRTT\]][B Each base has an associated Phred quality score (0-40)

Phred Scores and Error Probability Phred Quality Score Error Probability Accuracy of the base call (%) 10 0.1 90 20 0.01 99 30 0.001 99.9 40 0.0001 99.99 50 0.00001 99.999

FASTQ @HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1 TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCT + efcfffffcfeefffcffffffddf`feed]`]_ba_^ [YBBBBBBBBBBRTT\]][B Each base has an associated Phred quality score (0-40) Represented by a single printable ASCII character!"#$%&'()*+,-./0123456789:;<=>?@abcdefghijklmnopqrstuvwxyz[\]^_`abcdefgh Printable ASCII characters

FASTQ @HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1 TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCT + efcfffffcfeefffcffffffddf`feed]`]_ba_^ [YBBBBBBBBBBRTT\]][B Each base has an associated Phred quality score (0-40) Represented by a single printable ASCII character Decimal values 33 64 104!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefgh

FASTQ Encoding/Offset 33 64 104!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefgh Sanger (Phred+33) Illumina 1.8+ (Phred+33) Solexa (Phred+64) Illumina 1.3+ (Phred+64) Illumina 1.5+ (Phred+64)

FASTQ Encoding/Offset 33 64 104!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefgh Sanger (Phred+33) Illumina 1.8+ (Phred+33) Problem area Solexa (Phred+64) Illumina 1.3+ (Phred+64) Illumina 1.5+ (Phred+64) http://shop.alterlinks.com/ascii-table/ascii-table-us.php http://en.wikipedia.org/wiki/fastq_format#encoding

Quality score visualisamon (box plot) Q20 X axis base position Y axis quality score (0-40) Q20=99% accuracy FastQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Check Quality Scores FastQC: hcp://www.bioinformamcs.babraham.ac.uk/projects/fastqc/

Read Trimming and Adapter clipping 1) Fixed length trimming Remove all bases before/ader a given base Quick, computamonally easy Throws away good data and shortens reads 2) Quality-based trimming Remove bases following a "poor" quality base Retains more good data Throws away good data if there is a single spurious "poor" quality base

Main BioinformaMcs pipelines STACKS Website: hcp://catchenlab.life.illinois.edu/stacks/ mbrad, ddrad, ezrad & 2bRAD? STACKS does not handle INDELS, so any loci near an INDEL is lost STACKS does not call SNPs from paired end reads namvely, and does especially poorly with paired end fragments that are not of a random length (e.g., ddrad and ezrad) ddocent Website: hcps://ddocent.wordpress.com/ddocent-pipeline-user-guide/ ddrad & ezrad PyRAD Website: hcp://dereneaton.com/sodware/pyrad/ mbrad, ddrad, PE-ddRAD, GBS, PE-GBS, EzRAD, PE-EzRAD, 2B-RAD use of an alignment-clustering method (vsearch) 2bRAD (Wang et al 2012) de novo: hcps://github.com/z0on/2brad_denovo With reference genome: hcps://github.com/z0on/2brad_gatk 2bRAD

Stacks workflow Step1: Assessing the overall quality of the data Fixed length Trimming may be. Step2: process_redtags DemulMplexing: Assigning reads to individual samples using the unique barcodes in the reads. Required clean up is done as part of this process. 42

Stacks workflow Step2: process_redtags 1. Sample Barcodes are not Illumina index. These are included in sequence. Which requires inline barcodes demulmplexing. (Note: PE reads only have barcode in the 1st read) 2. All reads are expected to have restricmon enzyme site signature ader barcode sequence. 3. Barcode length variamon may need separate instance for each barcode size. 43

Example data from 131N Raw Data: lib-1_abcdeffxx_gccaat_l1_r1.fastq.gz SampleSheet: Barcode Sample-ID UsedForTagCatalogue AACT 20000_ddRAD_AGRF CCTA 20001_ddRAD_AGRF TTAC 28368_ddRAD_AGRF AGGC 28369_ddRAD_AGRF GCAT 28370_ddRAD_AGRF Y Y Y Y Y TAGA 28371_ddRAD_AGRF Y DemulMplexed fastq: 20000_ddRAD_AGRF_AACT.fq.gz, 20001_ddRAD_AGRF_CCTA.fq.gz

Stacks workflow: Step2: process_redtags Demul@plexing sta@s@cs (QC) Total number of reads Number of tags (stacks) per sample Number of common tags (typed across all samples) Average depth (reads NN) per tags N Tags in Catalogue Average Tag Depth SD of Depth Sample Tags Total Reads 10 21484 43,168 51,545 84.8 598.4 5196723 11 21485 51,824 61,791 85.3 737.1 6336018 12 21486 49,820 59,682 94.5 906.8 6658777 13 21488 28,725 29,371 22 400.2 773500 14 21489 35,351 36,152 21.3 138.3 905596 15 21490 37,250 38,368 27.4 206.9 1208992 45

QC: demulmplexed stats 46

QC: Yields by barcode length 47

Analysis of GBS repormng 1. Demul@plexing sta@s@cs (QC) Total number of reads Number of tags (stacks) per sample Number of common tags (typed across all samples) Average depth (reads NN) per tags 2. Results with variant calls Consensus sequence Number of samples genotyped Observed Alleles Individual genotypes allele coverage Other things in 3 rd party tools format: VCF, structure, phylip, etc.

Talk part2 49

Analysis of GBS repormng 1. Demul@plexing sta@s@cs (QC) Total number of reads Number of tags (stacks) per sample Number of common tags (typed across all samples) Average depth (reads NN) per tags 2. Results with variant calls Consensus sequence Number of samples genotyped Observed Alleles Individual genotypes allele coverage Other things in 3 rd party tools format: VCF, structure, phylip, etc.

Acknowledgements AGRF NGS lab Jafar Jabbari MaChew Tinning and team AGRF Genotyping Lab Melinda Zino and team EMBl-ABR: Vicky Schneider Phillipa Griffin