Read Quality Assessment & Improvement. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

Similar documents
Genomic DNA ASSEMBLY BY REMAPPING. Course overview

Quality assessment and control of sequence data. Naiara Rodríguez-Ezpeleta

Next Generation Sequencing Lecture Saarbrücken, 19. March Sequencing Platforms

Genomics AGRY Michael Gribskov Hock 331

Differential gene expression analysis using RNA-seq

De Novo Assembly of High-throughput Short Read Sequences

Next-Generation Sequencing. Technologies

Analysing genomes and transcriptomes using Illumina sequencing

Sequencing techniques and applications

RNA-Seq Software, Tools, and Workflows

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Incorporating Molecular ID Technology. Accel-NGS 2S MID Indexing Kits

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Alignment. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

Next Gen Sequencing. Expansion of sequencing technology. Contents

Third Generation Sequencing

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech

BST227 Introduction to Statistical Genetics. Lecture 8: Variant calling from high-throughput sequencing data

High Throughput Sequencing Technologies. UCD Genome Center Bioinformatics Core Monday 15 June 2015

Introduction to RNA-Seq

Targeted Sequencing Using Droplet-Based Microfluidics. Keith Brown Director, Sales

NGS sequence preprocessing. José Carbonell Caballero

Introductie en Toepassingen van Next-Generation Sequencing in de Klinische Virologie. Sander van Boheemen Medical Microbiology

Sanger vs Next-Gen Sequencing

Automated size selection of NEBNext Small RNA libraries with the Sage Pippin Prep

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Monday June 16, 2014

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

European Union Reference Laboratory for Genetically Modified Food and Feed (EURL GMFF)

Introductory Next Gen Workshop

Long and short/small RNA-seq data analysis

Single Cell Genomics

Next Generation Sequencing Technologies. Some slides are modified from Robi Mitra s lecture notes

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

Genome Sequence Assembly

Base Composition of Sequencing Reads of Chromium Single Cell 3 v2 Libraries

Next Generation Sequencing. Jeroen Van Houdt - Leuven 13/10/2017

RNA-Sequencing analysis

Basic Bioinformatics: Homology, Sequence Alignment,

Measuring transcriptomes with RNA-Seq

Biochemistry 412. New Strategies, Technologies, & Applications For DNA Sequencing. 12 February 2008

Sanger sequencing troubleshooting guide. GATC Biotech AG

Welcome to the NGS webinar series

Infectious Disease Omics

Next Generation Sequencing: An Overview

Data Analysis with CASAVA v1.8 and the MiSeq Reporter

RADSeq Data Analysis. Through STACKS on Galaxy. Yvan Le Bras Anthony Bretaudeau Cyril Monjeaud Gildas Le Corguillé

scgem Workflow Experimental Design Single cell DNA methylation primer design

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

A step-by-step guide to ChIP-seq data analysis

Mapping strategies for sequence reads

Bioinformatics Advice on Experimental Design

HLA and Next Generation Sequencing it s all about the Data

Human genome sequence

Reference genomes and common file formats

Sequencing Theory. Brett E. Pickett, Ph.D. J. Craig Venter Institute

Why can GBS be complicated? Tools for filtering, error correction and imputation.

HiSeqTM 2000 Sequencing System

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

Technical note: Molecular Index counting adjustment methods

Considerations for Illumina library preparation. Henriette O Geen June 20, 2014 UCD Genome Center

L3: Short Read Alignment to a Reference Genome

Introduction to Next Generation Sequencing (NGS)

DNA-Sequencing. Technologies & Devices. Matthias Platzer. Genome Analysis Leibniz Institute on Aging - Fritz Lipmann Institute (FLI)

DNA-Sequencing. Technologies & Devices. Matthias Platzer. Genome Analysis Leibniz Institute on Aging - Fritz Lipmann Institute (FLI)

Jenny Gu, PhD Strategic Business Development Manager, PacBio

Introduction to RNA sequencing

axe Documentation Release g6d4d1b6-dirty Kevin Murray

CNV and variant detection for human genome resequencing data - for biomedical researchers (II)

Introduction to transcriptome analysis using High Throughput Sequencing technologies. D. Puthier 2012

In this protocol, DNA Strider for Mac is used for demonstration. The design of oligos for deleting Adephagia gp73 is used as an example.

RNAseq Differential Gene Expression Analysis Report

How much sequencing do I need? Emily Crisovan Genomics Core

Variation detection based on second generation sequencing data. Xin LIU Department of Science and Technology, BGI

Workflow of de novo assembly

RIPTIDE HIGH THROUGHPUT RAPID LIBRARY PREP (HT-RLP)

Introduction to Next Generation Sequencing (NGS) Andrew Parrish Exeter, 2 nd November 2017

SUPPLEMENTARY MATERIAL AND METHODS

RNA-Seq with the Tuxedo Suite

Gene Expression Technology

1. A brief overview of sequencing biochemistry

Research school methods seminar Genomics and Transcriptomics

Analysis of barcode sequencing

1.1 Post Run QC Analysis

Announcements. Coffee! Evalua,on. Dr. Yoshiki Sasai, R.I.P.

Genome Sequencing. I: Methods. MMG 835, SPRING 2016 Eukaryotic Molecular Genetics. George I. Mias

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory

Assembly of Ariolimax dolichophallus using SOAPdenovo2

Measuring transcriptomes with RNA-Seq. BMI/CS 776 Spring 2016 Anthony Gitter

DNA sequencing. Course Info

Fundamentals of Next-Generation Sequencing: Technologies and Applications

Analysis of Differential Gene Expression in Cattle Using mrna-seq

De novo Genome Assembly

Carl Woese. Used 16S rrna to developed a method to Identify any bacterium, and discovered a novel domain of life

Chapter 15 Gene Technologies and Human Applications

NOW GENERATION SEQUENCING. Monday, December 5, 11

De novo whole genome assembly

Transcription:

Read Quality Assessment & Improvement UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

QA&I should be interactive

Error modes Each technology has unique error modes, depending on the physico-chemical processes involved in the whole sequencing life cycle (not just base-calling step). Improving reads will work better if the assumptions made by the remediation tools match the source(s) of error. How do you know? Trial and error? QA&I is experimental, just like bench science.

Illumina read problems Contaminating sequence within reads adapters adapter dimers Poor quality and/or wrong sequence substitution, insertion / deletion ( indel ) errors Sample contamination Chimerism in library Sampling bias

Illumina errors Illumina errors are biased - they occur after some sequence motifs (not well addressed by any tools currently, IMO), and predominantly at the 3 -ends of reads. Polymerase errors explain isolated errors, but 3 bias is less intuitive.

Illumina - 3 -end errors (glass substrate)

Illumina - 3 -end errors (glass substrate)

Illumina - 3 -end errors 5 -CTCTTCCGATCT <-- add sequencing primers 5 -CTCTTCCGATCT 5 -CTCTTCCGATCT 5 -CTCTTCCGATCT (glass substrate) 5 -CTCTTCCGATCT 5 -CTCTTCCGATCT 5 -CTCTTCCGATCT 5 -CTCTTCCGATCT

Illumina - 3 -end errors 5 -CTCTTCCGATCTC <-- cycle 1 5 -CTCTTCCGATCTC 5 -CTCTTCCGATCTC 5 -CTCTTCCGATCTC (glass substrate) 5 -CTCTTCCGATCTC 5 -CTCTTCCGATCTC 5 -CTCTTCCGATCTC 5 -CTCTTCCGATCTC

Illumina - 3 -end errors 5 -CTCTTCCGATCTCT <-- cycle 2 5 -CTCTTCCGATCTCT 5 -CTCTTCCGATCTCT 5 -CTCTTCCGATCTCT (glass substrate) 5 -CTCTTCCGATCTCT 5 -CTCTTCCGATCTCT 5 -CTCTTCCGATCTCT 5 -CTCTTCCGATCTCT

Illumina - 3 -end errors 5 -CTCTTCCGATCTCTC <-- cycle 3 5 -CTCTTCCGATCTCTC 5 -CTCTTCCGATCTCTC 5 -CTCTTCCGATCTCTC (glass substrate) 5 -CTCTTCCGATCTCTC 5 -CTCTTCCGATCTCTC 5 -CTCTTCCGATCTCTC 5 -CTCTTCCGATCTCTC

Illumina - 3 -end errors 5 -CTCTTCCGATCTCTCTGCGCTTGAGAG in phase 5 -CTCTTCCGATCTCTCTGCGCTTGAGAG in phase 5 -CTCTTCCGATCTCTCTGCGCTTGAGAG in phase 5 -CTCTTCCGATCTCTCTGCGCTTGAGAG in phase (glass substrate) 5 -CTCTTCCGATCTCTCTGCGCTTGAGAGA pre-phasing (+1) 5 -CTCTTCCGATCTCTCTGCGCTTGAGAG in phase 5 -CTCTTCCGATCTCTCTGCGCTTGAGA post-phasing (-1) 5 -CTCTTCCGATCTCTCTGCGCTTGAGAG in phase

Illumina - 3 -end errors # of molecules e l c y C 1-2 A T C G -1 +0 +1 +2 True cycle offset (pre- / post-phasing events)

Illumina - 3 -end errors # of molecules Cy stochastic variability -2 A T C G e l c 15-1 +0 +1 +2 +3 Process Error

Illumina - 3 -end errors

Intensity Illumina - 3 -end errors = -2 A T C G -2 +0 +1 +2 +3 Measurement Error

Illumina - 3 -end errors 1 25 5 75 Measurement Error

Illumina - 3 -end errors

Illumina - error rates Overall Illumina error rate ~ 0.1-1% Of that, 99% are substitutions, 1% are insertions / deletions ( indels )

Adapter contamination

Adapter contamination Older "in-line" or "homebrew" adapters can be added to one or both ends of DNA library fragments. Tools like Sabre (Nik Joshi) can recognize these, separate reads into different files, and remove barcode bases.

Adapter contamination The problem is heterogeneous fragment sizes, resulting from any of the current library preparation techniques. All libraries will contain DNA fragments of variable size.

Adapter contamination Contamination is the result of the sequencer reading through a short read, into adapter sequence that didn't come from your sample!

Adapter contamination Where can you find out adapter sequences? Google "github ucdavis-bioinformatics", look for Scythe, look for "*_adapters.fa" Check Seqanswers.com Contact Illumina, PacBio, etc. for "tech notes" specifying the library prep primer / adapter sequences (not always that clear to work out). Find them in your data.

Adapter contamination >TruSeq_forward_contam AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC[8bp index]atctcgtatgccgtcttctgcttgaaaaa >TruSeq_reverse_contam AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT[8bp index]gtggtcgccgtatcattaaaaa >Nextera_forward_contam CTGTCTCTTATACACATCTCCGAGCCCACGAGAC[8bp index]atctcgtatgccgtcttctgcttg >Nextera_reverse_contam CTGTCTCTTATACACATCTGACGCTGCCGACGA[8bp index]gtgtagatctcggtggtcgccgtatcatt >TruSeq_SmallRNA_forward_contam TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC[6bp adapter]atctcgtatgccgtcttctgcttg >TruSeq_SmallRNA_reverse_contam GATCGTCGGACTGTAGAACTCTGAACCTGTCG Also note small RNA trimming instructions here: http://dnatech.genomecenter.ucdavis.edu/faqs/ find mirna on page

Base quality in the FASTQ format

Base quality in the FASTQ format SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS......XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ... LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL...!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{ }~ 33 59 64 73 104 126 0...26...31...40-5...0...9...40 0...9...40 3...9...40 0.2...26...31...41 S - Sanger Phred+33, raw reads typically (0, 40) X - Solexa Solexa+64, raw reads typically (-5, 40) I - Illumina 1.3+ Phred+64, raw reads typically (0, 40) J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) https://en.wikipedia.org/wiki/fastq_format with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) (Note: See discussion above). L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)

Base qualities

FASTQ - Pop Quiz! 1. What does a quality character of ";" mean? 2. In Sanger (standard) FASTQ, which ASCII character would I use to indicate that I'm absolutely sure that I'm wrong about a particular base? 3. If a particular 40 bp read from a run analyzed with Illumina Pipeline 1.6 (phred + 64) had consistent quality characters of "J", how many errors should you expect in the read?

FASTQ - Base order / read orientation An "F/R" pair, or "innies"

Back to contamination / quality issues

Back to contamination / quality issues

Illumina Read IDs older pipelines newer pipelines Do your FASTQ files begin and end with the same IDs? Incomplete downloads, accidental sorting, different trimming, etc. can get your forward and reverse read files out of sync with each other.

Illumina Read IDs @DJB77P1:497:H76H3ADXX:1:1101:1417:2075 1:N:0:GCGCTA NTTGCGATAAGGCTCCGGATCATTGCGATTGGTCAGCATCACCACCGTCA + #4BDDFFFHHHHHJJJJJJJJJJJJIJIJJIJJJJJJJJJJJJJJJJJJJ @ + F/R pair @DJB77P1:497:H76H3ADXX:1:1101:1417:2075 2:N:0:GCGCTA ATGGCGGTATCTATTCTTCGATCGACGATCTGGCGAAGTGGGACGCGGCT + C@CFFFFDHHHHGGJGIJGIIIIIGGIGHGIIIEHEGH;CHGAEF<BB/; @ +

Illumina Read IDs @DJB77P1:497:H76H3ADXX:1:1101:1417:2075 1:N:0:GCGCTA NTTGCGATAAGGCTCCGGATCATTGCGATTGGTCAGCATCACCACCGTCA + #4BDDFFFHHHHHJJJJJJJJJJJJIJIJJIJJJJJJJJJJJJJJJJJJJ @ + N = Not a bad read. Seriously. Y = Yes, it did violate the chastity filter. Usually these are removed, but some providers leave them in, and these could be good reads. Or maybe not. Barcode / Index. May contain mismatches to the real barcode, if pipeline was run allowing mismatches.

Illumina Read IDs @DJB77P1:497:H76H3ADXX:1:1101:1417:2075 1:N:0:GCGCTA NTTGCGATAAGGCTCCGGATCATTGCGATTGGTCAGCATCACCACCGTCA + #4BDDFFFHHHHHJJJJJJJJJJJJIJIJJIJJJJJJJJJJJJJJJJJJJ @ + Most providers now spike phix174 library into every lane. If a read aligns to the phix174 reference, this field will contain a number the coordinate where the read aligns. It may be important to filter these reads out, depending on downstream processing.

Tools!

Scythe

Sickle

Error Correction Paired-read overlap ( read merging, paired read assemblers ) FLASH PEAR PANDAseq Correct bases in overlapping region; output a single read No merging / correction possible; output pair of reads Correct in overlapping region; trim overhangs (adapter); output single read

Questions?