Quality control for Sequencing Experiments

Size: px
Start display at page:

Download "Quality control for Sequencing Experiments"

Transcription

1 Quality control for Sequencing Experiments v Simon Andrews

2 Support service for bioinformatics Academic Babraham Institute Commercial Consultancy Support BI Sequencing Facility MiSeq/ HiSeq/ NextSeq-based sequencing service Data Management / Processing / Analysis

3 Interests in QC Developed QC for in-house sequencing Developed QC packages FastQC BamQC FastQ Screen Developed application specific QC Bismark (bisulphite methylation) HiCUP (Hi-C genome structure) Developed data visualisation QC SeqMonk (generic sequencing visualisation / analysis) RNA-Seq QC Small RNA QC Duplication QC

4 Course Structure How does Illumina Sequencing work What QC metrics can we work with How do sequencing experiments go wrong Technical problems in the sequencing process Conceptual Problems in design, analysis, interpretation

5 How Does Illumina Sequencing Work?

6

7 Sequencing by Synthesis (SBS) G A T T C A G

8 Solexa Sequencing C G A T T C A G

9 Solexa Sequencing C G A T T C A G

10 Solexa Sequencing C T G A T T C A G

11 Solexa Sequencing C T G A T T C A G

12 Solexa Sequencing C T A G A T T C A G

13 Solexa Sequencing C T A G A T T C A G

14 Solexa Sequencing C T A A G A T T C A G

15 Solexa Sequencing C T A A G A T T C A G

16 Solexa Sequencing C T A A G G A T T C A G

17 Solexa Sequencing C T A A G G A T T C A G

18 Solexa Sequencing C T A A G T G A T T C A G

19 Solexa Sequencing C T A A G T G A T T C A G

20 Solexa Sequencing C T A A G T C G A T T C A G

21 Microscope Fridge Pumps

22

23 Real Illumina Sequence Data

24 Creating Clusters Single molecule attaches randomly* Bridge amplification amplifies* Cluster of identical (ish) molecules created *Newer sequences (x10 and HiSeq4000 use bead guided positioning and isothermal expansion)

25 Good and Bad things about clusters Good Generates large signal Is robust to random mistakes Needs a small amount of starting material Bad Bridging limits length Molecules in a cluster get out of sync 2 bases added No bases added Reaction stalls Can get mixed signals if clusters overlap

26 Different sequencers, same chemistry Sequencer Number of lanes Reads per lane Max read length Dyes iseq 1 ~4 million 150bp 2 MiniSeq 1 ~7 million 150bp 2 MiSeq 1 ~20 million 300bp 4 NextSeq million 150bp 2 HiSeq 2xxx 16 ~200 million 150bp 4 HiSeq 4xxx 16 ~300 million 150bp 4 NovaSeq 8 ~2.5 billion 150bp 2

27 FastQ Format 1:N:0: TCGATAATACCGTTTTTTTCCGTTTGATGTTGATACCATT + 1:N:0: TATCTGTAGATTTCACAGACTCAAATGTAAATATGCAGAG + 1:N:0: GGAGGAAGTATCACTTCCTTGCCTGCCTCCTCTGGGGCCT + 1:N:0: GCCGGGCACAGGGGCTCATGCCTGTAATCCCAGCACTTTG + 1:N:0: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGT + 1:N:0: TGGAACCAAGAAAAAGAGGAAATGAGTGGAGACGTGGTGG + 1:N:0: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGT + GIHIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIII

28 A single FastQ 1:N:0: GGAGGAAGTATCACTTCCTTGCCTGCCTCCTCTGGGGCCT + :GBGGGGGGGGGDGGDEDGGDGGGGDHHDHGHHGBGG:GG 1. Header - starts 1. Must start with a unique identifier (up to first space) 2. Can often have substructure 2. Base calls (can include N or IUPAC codes) 3. Mid-line - starts with + usually empty 4. Quality scores 1. Per base signal:noise assessment 2. ASCII encoded Phred score

29 Illumina Header 1:N:0: Starts (required by fastq spec) Instrument ID (HWUSI-EAS611) Run number (34) Flowcell ID (6669YAAXX) Lane (5) Tile (1) X-position (5069) Y-position (1159) [space] Read number (1) Was filtered (Y/N) (N) - You wouldn't normally see the Ys Control number (0 = no control) Sample number (only if demultiplexed using Illumina's software)

30 Phred Scores Start from (p) - the probability that the reported call is incorrect Initial transformation to a Phred score - positive integer from floating point Phred = -10 * (int)log 10 (p) p=0.1 Phred = 10 p=0.01 Phred = 20 p=0.001 Phred = 30

31 Phred Score Encoding Translation of Phred score to single ASCII letter Based on standard ASCII table Can't translate directly as low values are nonprinting Two original standards Illumina = Phred+64 Sanger = Phred+33 All current data and all public repository data is (should be) Sanger encoded.

32 Phred score encoding :GBGGGGGGGGGDGGDEDGGDGGGGDHHDHGHHGBGG:GG : = ASCII 58 Phred33 encoding so Phred = 25 p = 10^(25/-10) p = 0.003

33 Phred score encoding :GBGGGGGGGGGDGGDEDGGDGGGGDHHDHGHHGBGG:GG G = ASCII 71 Phred33 encoding so Phred = 38 p = 10^(38/-10) p = 1.6e-4

34 Sequencing Library Structure Insert Barcode Adapter Adapter Barcode Primer Read 1 Adapter Insert Read 2 Primer Insert Adapter Barcode Read

35 Quality Control

36 What is the point of QC Your experiment is fragile Treat Cells Trim Adaptor Harvest Cells Map to Reference Extract RNA Quantitate Genes Make Library Run statistical test Send to Sequencing Service Find enriched functionality Run on Sequencer Report

37 Types of QC Visual inspections of cells / tissue Properties of DNA or Libraries (bioanalyser) Properties of raw sequence Properties of mapped sequence Properties of quantifications Behaviour of replicates Behaviour of statistical hits

38 Types of QC Visual inspections of cells / tissue Properties of DNA or Libraries (bioanalyser) Properties of raw sequence Properties of mapped sequence Properties of quantifications Behaviour of replicates Behaviour of statistical hits

39 What is the point of QC? Technical problems don't cause pipelines to fail Technical problems don't prevent hits being generated Technical hits often look biologically real Unexpected, interesting effects can easily be missed Finding problems through follow-on work is slow and expensive! QC saves you time and effort! (and money)

40 An example PC2 PC1

41 Genes for PC1 (85 total) Gene Description Arhgef4 Rho guanine nucleotide exchange factor (GEF) 4 Cflar CASP8 and FADD-like apoptosis regulator Als2 amyotrophic lateral sclerosis 2 (juvenile) homolog (human) Cxcr2 chemokine (C-X-C motif) receptor 2 Col4a3 collagen, type IV, alpha 3 Sag retinal S-antigen Gpr35 G protein-coupled receptor 35 Acmsd amino carboxymuconate semialdehyde decarboxylase Qsox1 quiescin Q6 sulfhydryl oxidase O13Rik RIKEN cdna O13 gene Mrps14 mitochondrial ribosomal protein S14 Scyl3 SCY1-like 3 (S. cerevisiae) Ildr2 immunoglobulin-like domain containing receptor 2 Atp1a2 ATPase, Na+/K+ transporting, alpha 2 polypeptide Slamf8 SLAM family member 8 Wdr38 WD repeat domain 38 Exd1 exonuclease 3'-5' domain containing 1 Serf2 small EDRK-rich factor 2

42 Coverage of Raw Data Normal Gene PCA Gene

43 Using a different read mapper

44 Using a different read mapper

45 What we're going to do Look at what software is available Look at the reports you can generate and how to interpret them Look at ways in which we have seen experiments fail