Read Quality Assessment & Improvement UCD Genome Center Bioinformatics Core Tuesday 14 June 2016
QA&I should be interactive
Error modes Each technology has unique error modes, depending on the physico-chemical processes involved in the whole sequencing life cycle (not just base-calling step). Improving reads will work better if the assumptions made by the remediation tools match the source(s) of error. How do you know? Trial and error? QA&I is experimental, just like bench science.
Illumina read problems Contaminating sequence within reads adapters adapter dimers Poor quality and/or wrong sequence substitution, insertion / deletion ( indel ) errors Sample contamination Chimerism in library Sampling bias
Illumina errors Illumina errors are biased - they occur after some sequence motifs (not well addressed by any tools currently, IMO), and predominantly at the 3 -ends of reads. Polymerase errors explain isolated errors, but 3 bias is less intuitive.
Illumina - 3 -end errors (glass substrate)
Illumina - 3 -end errors (glass substrate)
Illumina - 3 -end errors 5 -CTCTTCCGATCT <-- add sequencing primers 5 -CTCTTCCGATCT 5 -CTCTTCCGATCT 5 -CTCTTCCGATCT (glass substrate) 5 -CTCTTCCGATCT 5 -CTCTTCCGATCT 5 -CTCTTCCGATCT 5 -CTCTTCCGATCT
Illumina - 3 -end errors 5 -CTCTTCCGATCTC <-- cycle 1 5 -CTCTTCCGATCTC 5 -CTCTTCCGATCTC 5 -CTCTTCCGATCTC (glass substrate) 5 -CTCTTCCGATCTC 5 -CTCTTCCGATCTC 5 -CTCTTCCGATCTC 5 -CTCTTCCGATCTC
Illumina - 3 -end errors 5 -CTCTTCCGATCTCT <-- cycle 2 5 -CTCTTCCGATCTCT 5 -CTCTTCCGATCTCT 5 -CTCTTCCGATCTCT (glass substrate) 5 -CTCTTCCGATCTCT 5 -CTCTTCCGATCTCT 5 -CTCTTCCGATCTCT 5 -CTCTTCCGATCTCT
Illumina - 3 -end errors 5 -CTCTTCCGATCTCTC <-- cycle 3 5 -CTCTTCCGATCTCTC 5 -CTCTTCCGATCTCTC 5 -CTCTTCCGATCTCTC (glass substrate) 5 -CTCTTCCGATCTCTC 5 -CTCTTCCGATCTCTC 5 -CTCTTCCGATCTCTC 5 -CTCTTCCGATCTCTC
Illumina - 3 -end errors 5 -CTCTTCCGATCTCTCTGCGCTTGAGAG in phase 5 -CTCTTCCGATCTCTCTGCGCTTGAGAG in phase 5 -CTCTTCCGATCTCTCTGCGCTTGAGAG in phase 5 -CTCTTCCGATCTCTCTGCGCTTGAGAG in phase (glass substrate) 5 -CTCTTCCGATCTCTCTGCGCTTGAGAGA pre-phasing (+1) 5 -CTCTTCCGATCTCTCTGCGCTTGAGAG in phase 5 -CTCTTCCGATCTCTCTGCGCTTGAGA post-phasing (-1) 5 -CTCTTCCGATCTCTCTGCGCTTGAGAG in phase
Illumina - 3 -end errors # of molecules e l c y C 1-2 A T C G -1 +0 +1 +2 True cycle offset (pre- / post-phasing events)
Illumina - 3 -end errors # of molecules Cy stochastic variability -2 A T C G e l c 15-1 +0 +1 +2 +3 Process Error
Illumina - 3 -end errors
Intensity Illumina - 3 -end errors = -2 A T C G -2 +0 +1 +2 +3 Measurement Error
Illumina - 3 -end errors 1 25 5 75 Measurement Error
Illumina - 3 -end errors
Illumina - error rates Overall Illumina error rate ~ 0.1-1% Of that, 99% are substitutions, 1% are insertions / deletions ( indels )
Adapter contamination
Adapter contamination Older "in-line" or "homebrew" adapters can be added to one or both ends of DNA library fragments. Tools like Sabre (Nik Joshi) can recognize these, separate reads into different files, and remove barcode bases.
Adapter contamination The problem is heterogeneous fragment sizes, resulting from any of the current library preparation techniques. All libraries will contain DNA fragments of variable size.
Adapter contamination Contamination is the result of the sequencer reading through a short read, into adapter sequence that didn't come from your sample!
Adapter contamination Where can you find out adapter sequences? Google "github ucdavis-bioinformatics", look for Scythe, look for "*_adapters.fa" Check Seqanswers.com Contact Illumina, PacBio, etc. for "tech notes" specifying the library prep primer / adapter sequences (not always that clear to work out). Find them in your data.
Adapter contamination >TruSeq_forward_contam AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC[8bp index]atctcgtatgccgtcttctgcttgaaaaa >TruSeq_reverse_contam AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT[8bp index]gtggtcgccgtatcattaaaaa >Nextera_forward_contam CTGTCTCTTATACACATCTCCGAGCCCACGAGAC[8bp index]atctcgtatgccgtcttctgcttg >Nextera_reverse_contam CTGTCTCTTATACACATCTGACGCTGCCGACGA[8bp index]gtgtagatctcggtggtcgccgtatcatt >TruSeq_SmallRNA_forward_contam TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC[6bp adapter]atctcgtatgccgtcttctgcttg >TruSeq_SmallRNA_reverse_contam GATCGTCGGACTGTAGAACTCTGAACCTGTCG Also note small RNA trimming instructions here: http://dnatech.genomecenter.ucdavis.edu/faqs/ find mirna on page
Base quality in the FASTQ format
Base quality in the FASTQ format SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS......XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ... LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL...!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{ }~ 33 59 64 73 104 126 0...26...31...40-5...0...9...40 0...9...40 3...9...40 0.2...26...31...41 S - Sanger Phred+33, raw reads typically (0, 40) X - Solexa Solexa+64, raw reads typically (-5, 40) I - Illumina 1.3+ Phred+64, raw reads typically (0, 40) J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) https://en.wikipedia.org/wiki/fastq_format with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) (Note: See discussion above). L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)
Base qualities
FASTQ - Pop Quiz! 1. What does a quality character of ";" mean? 2. In Sanger (standard) FASTQ, which ASCII character would I use to indicate that I'm absolutely sure that I'm wrong about a particular base? 3. If a particular 40 bp read from a run analyzed with Illumina Pipeline 1.6 (phred + 64) had consistent quality characters of "J", how many errors should you expect in the read?
FASTQ - Base order / read orientation An "F/R" pair, or "innies"
Back to contamination / quality issues
Back to contamination / quality issues
Illumina Read IDs older pipelines newer pipelines Do your FASTQ files begin and end with the same IDs? Incomplete downloads, accidental sorting, different trimming, etc. can get your forward and reverse read files out of sync with each other.
Illumina Read IDs @DJB77P1:497:H76H3ADXX:1:1101:1417:2075 1:N:0:GCGCTA NTTGCGATAAGGCTCCGGATCATTGCGATTGGTCAGCATCACCACCGTCA + #4BDDFFFHHHHHJJJJJJJJJJJJIJIJJIJJJJJJJJJJJJJJJJJJJ @ + F/R pair @DJB77P1:497:H76H3ADXX:1:1101:1417:2075 2:N:0:GCGCTA ATGGCGGTATCTATTCTTCGATCGACGATCTGGCGAAGTGGGACGCGGCT + C@CFFFFDHHHHGGJGIJGIIIIIGGIGHGIIIEHEGH;CHGAEF<BB/; @ +
Illumina Read IDs @DJB77P1:497:H76H3ADXX:1:1101:1417:2075 1:N:0:GCGCTA NTTGCGATAAGGCTCCGGATCATTGCGATTGGTCAGCATCACCACCGTCA + #4BDDFFFHHHHHJJJJJJJJJJJJIJIJJIJJJJJJJJJJJJJJJJJJJ @ + N = Not a bad read. Seriously. Y = Yes, it did violate the chastity filter. Usually these are removed, but some providers leave them in, and these could be good reads. Or maybe not. Barcode / Index. May contain mismatches to the real barcode, if pipeline was run allowing mismatches.
Illumina Read IDs @DJB77P1:497:H76H3ADXX:1:1101:1417:2075 1:N:0:GCGCTA NTTGCGATAAGGCTCCGGATCATTGCGATTGGTCAGCATCACCACCGTCA + #4BDDFFFHHHHHJJJJJJJJJJJJIJIJJIJJJJJJJJJJJJJJJJJJJ @ + Most providers now spike phix174 library into every lane. If a read aligns to the phix174 reference, this field will contain a number the coordinate where the read aligns. It may be important to filter these reads out, depending on downstream processing.
Tools!
Scythe
Sickle
Error Correction Paired-read overlap ( read merging, paired read assemblers ) FLASH PEAR PANDAseq Correct bases in overlapping region; output a single read No merging / correction possible; output pair of reads Correct in overlapping region; trim overhangs (adapter); output single read
Questions?