NGS sequence preprocessing. José Carbonell Caballero

Similar documents
Francisco García Quality Control for NGS Raw Data

Basic Bioinformatics: Homology, Sequence Alignment,

DATA FORMATS AND QUALITY CONTROL

BST 226 Statistical Methods for Bioinformatics David M. Rocke. March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1

Quality assessment and control of sequence data. Naiara Rodríguez-Ezpeleta

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

Quality assessment and control of sequence data

Illumina Sequencing Error Profiles and Quality Control

Two Mark question and Answers

Lecture 7. Next-generation sequencing technologies

Sanger vs Next-Gen Sequencing

Bioinformatics for NGS projects. Guidelines. genomescan.nl

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech

Genomic DNA ASSEMBLY BY REMAPPING. Course overview

RADseq Data Analysis Workshop 3 February 2017

Why QC? Next-Generation Sequencing: Quality Control. Illumina data format. Fastq format:

Next-Generation Sequencing: Quality Control

NEXT GENERATION SEQUENCING. Farhat Habib

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

Lecture 2: Biology Basics Con4nued

Quality Control of Sequencing Data

Sequencing techniques

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer.

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

RNA-Seq Module 2 From QC to differential gene expression.

Next-generation sequencing and quality control: An introduction 2016

Next Gen Sequencing. Expansion of sequencing technology. Contents

De Novo Assembly of High-throughput Short Read Sequences

ISO/IEC JTC 1/SC 29/WG 11 N15527 Warsaw, CH June Introduction

Introductie en Toepassingen van Next-Generation Sequencing in de Klinische Virologie. Sander van Boheemen Medical Microbiology

Introduction to NGS. Data Analysis

Mapping Next Generation Sequence Reads. Bingbing Yuan Dec. 2, 2010

DNASeq: Analysis pipeline and file formats Sumir Panji, Gerrit Boha and Amel Ghouila

NGS in Pathology Webinar

De Novo Assembly (Pseudomonas aeruginosa MAPO1 ) Sample to Insight

NUCLEIC ACIDS. DNA (Deoxyribonucleic Acid) and RNA (Ribonucleic Acid): information storage molecules made up of nucleotides.

Read Quality Assessment & Improvement. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

Analytics Behind Genomic Testing

Lecture 3: Biology Basics Con4nued. Spring 2017 January 24, 2017

Quality control for Sequencing Experiments

Genome 373: Mapping Short Sequence Reads II. Doug Fowler

Genomics AGRY Michael Gribskov Hock 331

Galaxy for Next Generation Sequencing 初探次世代序列分析平台 蘇聖堯 2013/9/12

Course Presentation. Ignacio Medina Presentation

Introduction of RNA-Seq Analysis

Quality assurance in NGS (diagnostics)

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

Differential gene expression analysis using RNA-seq

A step-by-step guide to ChIP-seq data analysis

Next Generation Sequencing Lecture Saarbrücken, 19. March Sequencing Platforms

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

European Union Reference Laboratory for Genetically Modified Food and Feed (EURL GMFF)

Deep Sequencing technologies

UAB DNA-Seq Analysis Workshop. John Osborne Research Associate Centers for Clinical and Translational Science

Comparing a few SNP calling algorithms using low-coverage sequencing data

RNA-Seq de novo assembly training

QIAseq Targeted Panel Analysis Plugin USER MANUAL

NGS, Cancer and Bioinformatics. 5/3/2015 Yannick Boursin

Introduction to CGE tools

Session 8. Differential gene expression analysis using RNAseq data

Analysis of RNA-seq Data. Feb 8, 2017 Peikai CHEN (PHD)

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

GENOME ASSEMBLY FINAL PIPELINE AND RESULTS

Applications of short-read

An overhang-based DNA block shuffling method for creating a customized random library

BIGpre: A Quality Assessment Package for Next-Generation Sequencing Data

MEEPTOOLS: A maximum expected error based FASTQ read filtering and trimming toolkit

From raw reads to variants

Supplementary Figures and Data

Parts of a standard FastQC report

Next Generation Sequencing: An Overview

Assignment 9: Genetic Variation

Sequencing applications. Today's outline. Hands-on exercises. Applications of short-read sequencing: RNA-Seq and ChIP-Seq

DNBseq TM SERVICE OVERVIEW Plant and Animal Whole Genome Re-Sequencing

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang

Introduction to Bioinformatics analysis of Metabarcoding data

Differential gene expression analysis using RNA-seq

RNA-Seq Analysis. Simon Andrews, Laura v

RNA-seq data analysis with Chipster. Eija Korpelainen CSC IT Center for Science, Finland

Astrocyte GCRB/BICF Workflow for ChIP-Seq Analysis. Venkat Beibei

Normal-Tumor Comparison using Next-Generation Sequencing Data

RNA Seq: Methods and Applica6ons. Prat Thiru

Introduction to RNA sequencing

Fast, Accurate and Sensitive DNA Variant Detection from Sanger Sequencing:

Herramientas para el diseño y el análisis de datos de paneles de genes

Experimental Design Microbial Sequencing

Faction 2: Genome Assembly Lab and Preliminary Data

Read Mapping and Variant Calling. Johannes Starlinger

Bionano Solve v3.0.1 Release Notes

VALIDATION OF HLA TYPING BY NGS

2nd (Next) Generation Sequencing 2/2/2018

CBC Data Therapy. Metagenomics Discussion

Zika infected human samples

SMRT Analysis Barcoding Overview

Introduction to bioinformatics (NGS data analysis)

Introduction to DNA-Sequencing

N ext-generation sequencing (NGS) technologies have become common practice in life science1. Benefited

CS 680: Assembly and Analysis of Sequencing Data. Fall 2012 August 21st, 2012

TEDDY Omics Data Availability

Accelerate precision medicine with Microsoft Genomics

Transcription:

NGS sequence preprocessing José Carbonell Caballero jcarbonell@cipf.es

Contents Data Format Quality Control Sequence capture Fasta and fastq formats Sequence quality encoding Evaluation of sequence quality Quality control tools Identification of typical artifacts Sequence filtering Practical session 2

Sequence capture RAW data Propietary format FastQ 3

Different technologies 4

Genome sequencing Reference genome 5

Where we are? Sequence preprocessing NGS pipeline Mapping Variant calling Downstream analysis 6

Fasta and Fastq formats Standard formats for sequence storage Text-based formats (easy to use!) (Almost) every programming language has a parser 7

Fasta format Nucleotide or peptide sequence >gi 5524211 gb AAD44166.1 cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY >BBTBSCRYR tgcaccaaacatgtctaaagctggaaccaaaattactttctttgaagacaaaaactttca aggccgccactatgacagcgattgcgactgtgcagatttccacatgtacctgagccgctg caactccatcagagtggaaggaggcacctgggctgtgtatgaaaggcccaattttgctgg gtacatgtacatcctaccccggggcgagtatcctgagtaccagcactggatgggcctcaa Some typical file extensions (.fasta,.fa,.fna,.ffn,.faa,...) 8

Fasta format Allow multi-sequence (typically different chromosomes) >scaffold_1 CAAGGCTATAGCCCACCCGTTTTTGTGGCCTTTTTCCGCTGGACGAACTTGGCGCCCCGGCCTTCGGGTGGTTATTTTTG GGTGCAGCCTAGTGCGCGGCCTATTTTTGGCACACGGAGGCCCCTCGCAAAGTCTCGCGGCATTCCGAGTCCGGCGGACG AAGTTGGCCAGCCCCACCCCCCAAGGCTATAGCCCACCCGTTTTTGTGGCTTTTTTCCGCTGGACGAACTTGGCGCCCCG GCCTTCGGGCGGTTATTTTTGGCCGCGGCTTCGTCCGTGGCCTATTTTTGGCACACGGAGGCCCCCCGCAAACTCTCCCG GCATTCCGAGTCCGGCGGACGAAGTCGGCCAGCTTCACCACCCAGGCTGTTGCCCACCCCTTTTTGTGGCCTTTTTCCGC TGGACGAACTTGGCGCCCCAGCCTTTGGGCTGTTATTTATGGGTGCGGCTTTGTCTACGGCCTATTTTTGGCAGACGCAG GCCCCTCGCAAAGTCTCGCGGCATTCCGAGTCCGGCGGACGAAGTCGGCCAGCCCCACCCCCCAAGGCTATAGCCCACCC GTTTTTGTGGCCTTTTTCCGCTGGACGAACTTGGCGCCCCGGCCTTCGGGCGGTTTTTTTCTGGTGCGGTTTTGCCCGCG GCCTATTTTTGGCACTTGGAGGCCCCTCGTAAAGTCTCGCGGCATTCCGAGTCCGGCGGAAGAAGTTGGCCAGCCCCACC CTCCAAGGCTATAGCCCACCCGTTTTTGTGGCCTTTTTCCGCTGGACGAACTTGGGGCCCCGGCCTTCGGGCGGTTATTT -- >scaffold_2 ACGGTCCGGGGGCATCGGGGTGGGGGGGTAGCCGCGCGGCAGTTTGAACGGCGAAAAATGGGCAAGATCGGGGGCCGCTT AGTAGGCTTTGCCGGTTTCGGCCGTAACTCATGATCCGGGCCTCCGATTCCGTTTCCGTTTGGTCCCACGGGACCAGCGG TCCGGGGGCATCGGCAAGGGGGGGTAGCCGTGCGGCGGTTTGATCGCCGAAAAATGGGTAAGATCGGGGCCGTTTGGCCC GTTTTGCCGGTTTCGGCCCCCCGGGGGGCGGTTCATGCCCCGGGGGGGACAGCGGGGCGGGATCGGCACACGGCGGGTGA GGTGATCGGGTTCAGGCGGGTGCGGTCGGCGGCGGGGGCGCCCGGGGGGCTATAGCGCACCGCCTCGGGGGCCGCAAATT GGGGGCCGTAACTCATGATCGGGGCCCCCGATTCCGTTTCCGTTTCGTACCACGGGACCAGCGGTCCGGGGGCATCGGGA TGGGGGGGTAGCCGCGCGGCAGTTTGAACGGCGAAAAATGGGAAAGATCGGGGGCCGCTTAGTAGGCTTTGCCGGTTTCG GCCGTAACTCATGATCCGGGCCTCCGATTCCGTTTCCGTTTGGTCCCACGGGACCAGCGGTCCGGGGGCATCGGGATGGG GGGGTAGCCGTGCGGCGGTTTGATCGCCGAAAAATGGGTAAGATCGGGGCCGTTTGGCCCGTTTTGCCGGTTTCGGCCCC CCGGGGGGCGGTTCATGCCCCGGGGGGCGGTTCATGCCCCGGGGGGGACAGCGGGGCGGGATCGGTACACGGCGGGGGAG -- >scaffold_21 ACATATATAAAGTATTGTACTAGAAAACATTGTAAATGTATGCCTATTTAAACTTCAAGTATATGTAACACTTTCAAAGT CATAAGTGTAAAACTATTATATTTAAGATGTTTTGAGTTTATAAAAATAAACAACTTAACACTTCTACTCAGACGTTAAA AAATAAAAAACAAACATATTATTCTAATATATACATTCATACCCACTTAGACTCATTCAAGTTTAAGATATAAAAAAAAA AAAAGTGCTACCTAGAAATCCTTAATAGCAAGTCTTGTCTTTTATTTCATTAAGGGTAAAATCCTAAATGTGGCAAATGG CGTGTTATAAGAATTTTTGAGGGGTGCCAAATGCCCAATTCATTCATATTAGGCTTCTGAAAGAAGGCCCATTCATTCAC GACTTCGGGTCGCCCACCTGCTTGAATCCGTTTGTGTTTTTAACGTTGCAAGTATACCCATGTAAGTAAAAACAAACAAA 9

Fastq format Let's say fastq is a fasta with qualities @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT +!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 Fasta storages genomes...and fastq fragments http://www.ncbi.nlm.nih.gov/pmc/articles/pmc2847217/ 10

Sequence quality encoding Base quality must be encoded in just 1 byte! @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT +!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 Each base has a corresponding quality value (quality in position n, corresponds to base of position n) How is the encoding? Error probability Phred transformation (inversed integer value) ASCII encoding 11

Sequence quality encoding Phred + 33 Sanger [0,40] Illumina 1.8 [0,41] Phred + 64 Illumina 1.3 [0,40] Illumina 1.5 [3,40] http://en.wikipedia.org/wiki/fastq_format 12

Sequence quality encoding Phred scores 13

Sequence quality encoding Phred scores Error probability Phred transformation (inversed integer value) ASCII encoding @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT +!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 P = 0.01 Q = -10*log10(0.01) = 20 Phred+33 ASCII 33+20 = 53 => 5 P = 1 Q = -10*log10(0.01) = 0 Phred+33 ASCII 33+0 = 33 =>! 14

Evaluation of sequence quality Primary tool to assess sequencing If we evaluate our sequence quality in deep...... then we will known how reliable are our results QC determines posterior filtering We must be consistent with any filtering decision......if not, dowstream analysis will suffer the consequences QC must be test after every critical step 15

Evaluation of sequence quality How? quality per base Quality (or error probability) will be also a topic in next pipeline steps 16

Quality control tools FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ 17

Quality control tools fastx-toolkit http://hannonlab.cshl.edu/fastx_toolkit/download.html 18

Quality control tools NGS QC Toolkit http://www.nipgr.res.in/ngsqctoolkit.html 19

Quality control tools Example GOOD quality http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_ sequence_short_fastqc/fastqc_report.html POOR quality http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad _sequence_fastqc/fastqc_report.html 20

Typical artifacts Sequence adapters http://rowley.mit.edu/bmc_illumina/120414_d0y3c_423f/120414_sn333_0250_ad0y3cacxx/120414_sn333_0250_ad0y3cacxx_pipeline/120414_d0 21

Typical artifacts Poor quality data http://cbio.mskcc.org/~lianos/files/scott/2011-11-21/qc/hesr2_cgatgt_l001_r1_001_fastqc/fastqc_report.html 22

Typical artifacts Platform dependent 23

Sequence filtering (and editing) remove bad quality data Improve confidence of downstream analysis 24

Sequence filtering (and editing) Tail quality trimming minimum quality threshold 25

Sequence filtering (and editing) Mean quality Read length Read length after trimming Percentage of bases above Q Adapter trimming Adapter reads 26

Sequence filtering tools Fastx-toolkit Galaxy (https://main.g2.bx.psu.edu/) SeqTK (https://github.com/lh3/seqtk) Cutadapt (http://code.google.com/p/cutadapt/) And more... 27

Any question? 28