DATA FORMATS AND QUALITY CONTROL

Similar documents
Quality assessment and control of sequence data. Naiara Rodríguez-Ezpeleta

Quality assessment and control of sequence data

NGS sequence preprocessing. José Carbonell Caballero

Read Quality Assessment & Improvement. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

Next Generation Sequencing Lecture Saarbrücken, 19. March Sequencing Platforms

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech

Sanger vs Next-Gen Sequencing

De Novo Assembly of High-throughput Short Read Sequences

Genomic DNA ASSEMBLY BY REMAPPING. Course overview

Illumina Read QC. UCD Genome Center Bioinformatics Core Monday 29 August 2016

Genomics AGRY Michael Gribskov Hock 331

Next Gen Sequencing. Expansion of sequencing technology. Contents

Reference genomes and common file formats

Welcome to the NGS webinar series

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

Introduction to RNA sequencing

Long and short/small RNA-seq data analysis

Incorporating Molecular ID Technology. Accel-NGS 2S MID Indexing Kits

Introduction to transcriptome analysis using High Throughput Sequencing technologies. D. Puthier 2012

Introduction to Next Generation Sequencing (NGS) Andrew Parrish Exeter, 2 nd November 2017

Introductie en Toepassingen van Next-Generation Sequencing in de Klinische Virologie. Sander van Boheemen Medical Microbiology

Course Presentation. Ignacio Medina Presentation

BST227 Introduction to Statistical Genetics. Lecture 8: Variant calling from high-throughput sequencing data

CNV and variant detection for human genome resequencing data - for biomedical researchers (II)

axe Documentation Release g6d4d1b6-dirty Kevin Murray

RIPTIDE HIGH THROUGHPUT RAPID LIBRARY PREP (HT-RLP)

Differential gene expression analysis using RNA-seq

Next-Generation Sequencing. Technologies

Next-Generation Sequencing Services à la carte

Introductory Next Gen Workshop

Read Mapping and Variant Calling. Johannes Starlinger

A step-by-step guide to ChIP-seq data analysis

RNA-seq Data Analysis

Data Analysis with CASAVA v1.8 and the MiSeq Reporter

Post- sequencing quality evalua2on. or what to do when you get your reads from the sequencer

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang

Next Generation Sequencing: An Overview

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

RNAseq Differential Gene Expression Analysis Report

European Union Reference Laboratory for Genetically Modified Food and Feed (EURL GMFF)

RNA-Seq Software, Tools, and Workflows

Introduction to RNA-Seq

scgem Workflow Experimental Design Single cell DNA methylation primer design

From reads to results. Dr Torsten Seemann

Base Composition of Sequencing Reads of Chromium Single Cell 3 v2 Libraries

Basic Bioinformatics: Homology, Sequence Alignment,

ChIP-seq analysis. adapted from J. van Helden, M. Defrance, C. Herrmann, D. Puthier, N. Servant

L3: Short Read Alignment to a Reference Genome

NGS technologies approaches, applications and challenges!

RNA Seq: Methods and Applica6ons. Prat Thiru

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Implementation of Automated Sample Quality Control in Whole Exome Sequencing

Sequencing techniques and applications

Targeted Sequencing in the NBS Laboratory

RNA- seq data analysis tutorial. Andrea Sboner

Single Cell Genomics

HLA and Next Generation Sequencing it s all about the Data

Automated size selection of NEBNext Small RNA libraries with the Sage Pippin Prep

TECH NOTE SMARTer T-cell receptor profiling in single cells

Analysis of barcode sequencing

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Alignment. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

Next Generation Sequencing Technologies. Some slides are modified from Robi Mitra s lecture notes

Lecture 2: Biology Basics Continued

Sequencing Theory. Brett E. Pickett, Ph.D. J. Craig Venter Institute

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Fundamentals of Next-Generation Sequencing: Technologies and Applications

Next Generation Sequencing. Dylan Young Biomedical Engineering

Green Center Computational Core ChIP- Seq Pipeline, Just a Click Away

with drmid Dx for Illumina NGS systems

Workflow of de novo assembly

Introduction to taxonomic analysis of metagenomic amplicon and shotgun data with QIIME. Peter Sterk EBI Metagenomics Course 2014

MEEPTOOLS: A maximum expected error based FASTQ read filtering and trimming toolkit

RNA-seq data analysis with Chipster. Eija Korpelainen CSC IT Center for Science, Finland

How much sequencing do I need? Emily Crisovan Genomics Core

A Roadmap to the De-novo Assembly of the Banana Slug Genome

SUPPLEMENTARY MATERIAL AND METHODS

SOLiD Total RNA-Seq Kit SOLiD RNA Barcoding Kit

Performance characteristics of the High Sensitivity DNA kit for the Agilent 2100 Bioanalyzer

Genome Reagent Kits v2 Quick Reference Cards

SNP calling and VCF format

Digital DNA/RNA sequencing enables highly accurate and sensitive biomarker detection and quantification

Mate-pair library data improves genome assembly

DNA vs. RNA DNA: deoxyribonucleic acid (double stranded) RNA: ribonucleic acid (single stranded) Both found in most bacterial and eukaryotic cells RNA

Targeted Sequencing Using Droplet-Based Microfluidics. Keith Brown Director, Sales

RNA-Sequencing analysis

DNA Technology. Asilomar Singer, Zinder, Brenner, Berg

MHC Region. MHC expression: Class I: All nucleated cells and platelets Class II: Antigen presenting cells

Index. E Electrophoretic Mobility Shift Assay (EMSA), 262 ENCODE project, 223, 224 European Nucleotide Archive (ENA), 34

Axygen AxyPrep Magnetic Bead Purification Kits. A Corning Brand

Introduction to NGS Analysis Tools

therascreen BRCA1/2 NGS FFPE gdna Kit Handbook Part 2: Analysis

De novo metatranscriptome assembly and coral gene expression profile of Montipora capitata with growth anomaly

Genome Sequence Assembly

Alignment methods. Martijn Vermaat Department of Human Genetics Center for Human and Clinical Genetics

Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome

Genomics and Transcriptomics of Spirodela polyrhiza

BENG 183 Trey Ideker. Genome Assembly and Physical Mapping

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

Transcription:

HTS Summer School 12-16th September 2016 DATA FORMATS AND QUALITY CONTROL Romina Petersen, University of Cambridge (rp520@medschl.cam.ac.uk) Luigi Grassi, University of Cambridge (lg490@medschl.cam.ac.uk)

HTS analysis workflow 2 Biological Sample Prepare library, sequence FASTQ Align short reads to genome VCF, BED, Downstream analysis SAM/BAM Call variants, differential expression,

From the sequencer to you 3 Sequencing is usually done by core facilities Each sequencing run will generate millions of short (~100 bp) reads + read quality score for each base They often perform initial processing Adaptor trimming Basic quality control Demultiplexing You will (usually) receive a FASTQ file

FASTQ Files 4 FASTQ = FASTA + Quality So what is FASTA?

FASTA Format (.fa,.fasta) 5 >CHROMOSOME_1 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC AACTCACAGTTTGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATT >CHROMOSOME_2 TCACAGTTTGGTTCAAAGCAGTATCGATCATATCGATCAAATAGTAAA Format for raw DNA/protein sequences For each sequence: 1. > NAME 2. Nucleotides, with line breaks every ~60 bp

FASTQ format (.fq,.fastq) 6 Format for DNA sequencing reads For each read: 1. @ Read ID 2. Nucleotide sequence of the read 3. + 4. Quality score for each nucleotide of the read

7 Illumina sequence identifiers

Quality Scores 8 p = probability that the base call is wrong p = 0.1 à Q = 10 p = 0.01 à Q = 20 P = 0.001 à Q = 30 Encoding: The Sanger Phred format can encode a quality score from 0 to 93 using ASCII 33 to 126: 33 126

Quality Score Encoding 9 ASCII Code Quality Score!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI 33 59 64 73 0...26...31...40 Each character has an associated ASCII Code Quality Score = ASCII Code Offset Normal Sanger Encoding is Phred + 33 Lowest:! = ASCII 33 = Quality 0 Highest: I = ASCII 73 = Quality 40

Quality Encoding Example 10!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI 33 40 45 50 55 59 64 70 73 0...7...17...26...31...40

Different Quality Encodings 11 Beware of different versions! (especially for old data)

Paired-end reads 12 Sequence both the 5' and 3' end of a fragment Will result in two FASTQ files for a single run file1 n file2 n

single-end vs paired-end reads 13 my_sequence.fastq @HWI-BRUNOP16X_0001:1:1:1466:1018#0/1 AAGGAAGTGCTTGTCTGGCTAACACAGCNAGNCACGTGAC + avfbe`^^^_tttssdffffdfffabbzbbfebafbbbbb SE my_sequence_1.fastq @HWI-BRUNOP16X_0001:1:1:1278:989#0/1 NAAATTTCGAATTTCTGTGAAGTAAGCATCTTCTTTGTCAT + BJJGGKIINN^^^^^QQNTUQOOTTTRTOTY^^Y^\\^^^\ my_sequence_2.fastq PE (from the Illumina website) @HWI-BRUNOP16X_0001:1:1:1278:989#0/2 AACCCACACAGGAGAGCAGCCTTACAGATGCAAATACTGTG + ]K fffffggghgeggggggdgggggfgggggegggghh

Sample barcoding and de-multiplexing 14 Barcoding De-multiplexing Barcode 1 Barcode 2 Barcode 3 + Sample 1 Sample 2 Sample 3 # Read 1 ATTAGACCTAAGCA # Read 2 GAGCAACGACTAC # Read 3 ATTAGGCCATACAT # Read 4 CCATAGGCTGACTA...

Common sequence artefacts in HTS data 15 q Read errors q Base calling errors q Small insertions and deletions q Poor quality reads q Primer / adapter contamination

16 Features across available QC tools

17 FastqScreen: contamination

FastQC: Per base sequence quality 18 FastQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

19 Quality trimming

Removal of adapter sequences 20 q Necessary when the read length > molecule sequenced e.g. small RNAs. 1 2 3 q Different scenarios requiring adapter removal 1 2 3 q q q Trim the 3 end Trim/discard the reads based on the residual minimum read length. Trim the adapter region but retain reads only with a minimum read-length. q Tools for adapter trimming q fastx_clipper (FastX-Toolkit), PRINSEQ

How to filter? 21 Fastx: http://hannonlab.cshl.edu/fastx_toolkit/ PRINSEQ: http://prinseq.sourceforge.net/ Tally and Reaper: http://www.ebi.ac.uk/~stijn/reaper/tally.html http://www.ebi.ac.uk/~stijn/reaper/reaper.html#recipe http://www.ebi.ac.uk/~stijn/reaper/src/reaper-12-048/ ShortRead (R): http://www.bioconductor.org/packages/release/bioc/html/ ShortRead.html

FASTQ Processing FASTX Toolkit 22 http://hannonlab.cshl.edu/fastx_toolkit/ Many tools for common operations on FASTQ files: Conversion Trimming (remove barcodes) Clipping (remove adapters) Quality trimmer (trim off low-quality bases) Quality filter (remove low-quality reads)

Filtering example 23 Before After

24 Filtering comes at a price

Important: PE files 25 my_sequence_1.fastq @HWI-BRUNOP16X_0001:1:1:1278:989#0/1 NAAATTTCGAATTTCTGTGAAGTAAGCATCTTCTTTGTCAT + BJJGGKIINN^^^^^QQNTUQOOTTTRTOTY^^Y^\\^^^\ my_sequence_2.fastq @HWI-BRUNOP16X_0001:1:1:1278:989#0/2 AACCCACACAGGAGAGCAGCCTTACAGATGCAAATACTGTG + ]K fffffggghgeggggggdgggggfgggggegggghh filtering filtering mapping

Important: PE files 26 my_sequence_1.fastq @HWI-BRUNOP16X_0001:1:1:1278:989#0/1 NAAATTTCGAATTTCTGTGAAGTAAGCATCTTCTTTGTCAT + BJJGGKIINN^^^^^QQNTUQOOTTTRTOTY^^Y^\\^^^\ my_sequence_2.fastq @HWI-BRUNOP16X_0001:1:1:1278:989#0/2 AACCCACACAGGAGAGCAGCCTTACAGATGCAAATACTGTG + ]K fffffggghgeggggggdgggggfgggggegggghh filtering filtering mapping How? Trim Galore! (http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/)

Conclusions 27 Quality control of sequencing data is essential for downstream analysis. A range of QC tools are available to remove noise. Decide on which data can be corrected and discard the rest.