ISO/IEC JTC 1/SC 29/WG 11 N15527 Warsaw, CH June Introduction

Similar documents
Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

Reference genomes and common file formats

DATA FORMATS AND QUALITY CONTROL

Normal-Tumor Comparison using Next-Generation Sequencing Data

NUCLEIC ACIDS. DNA (Deoxyribonucleic Acid) and RNA (Ribonucleic Acid): information storage molecules made up of nucleotides.

Reference genomes and common file formats

Introduction to Next Generation Sequencing (NGS) Andrew Parrish Exeter, 2 nd November 2017

2nd (Next) Generation Sequencing 2/2/2018

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech

Challenges in precision medicine: From sequencing to big data processing Mikel Hernaez

Sanger vs Next-Gen Sequencing

Bioinformatics in next generation sequencing projects

Analytics Behind Genomic Testing

Francisco García Quality Control for NGS Raw Data

Quality assessment and control of sequence data. Naiara Rodríguez-Ezpeleta

Introduction to CGE tools

Course Presentation. Ignacio Medina Presentation

Introduction to human genomics and genome informatics

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

Lecture 7. Next-generation sequencing technologies

Galaxy for Next Generation Sequencing 初探次世代序列分析平台 蘇聖堯 2013/9/12

Quality assessment and control of sequence data

Gene Expression analysis with RNA-Seq data

Read Mapping and Variant Calling. Johannes Starlinger

Processing Ion AmpliSeq Data using NextGENe Software v2.3.0

14 March, 2016: Introduction to Genomics

Fast, Accurate and Sensitive DNA Variant Detection from Sanger Sequencing:

Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer. Project XX1001. Customer Detail

Introduction to Next Generation Sequencing

Quantifying gene expression

Genome Reassembly From Fragments. 28 March 2013 OSU CSE 1

RADseq Data Analysis Workshop 3 February 2017

Genomics AGRY Michael Gribskov Hock 331

DNBseq TM SERVICE OVERVIEW Plant and Animal Whole Genome Re-Sequencing

Genomic Technologies. Michael Schatz. Feb 1, 2018 Lecture 2: Applied Comparative Genomics

QIAseq Targeted Panel Analysis Plugin USER MANUAL

Prioritization: from vcf to finding the causative gene

NGS in Pathology Webinar

Get to Know Your DNA. Every Single Fragment.

Compression and Integration of Genomic Variants Into Smart EHR Systems

Genome 373: Mapping Short Sequence Reads II. Doug Fowler

Next Gen Sequencing. Expansion of sequencing technology. Contents

Next Generation Sequencing Lecture Saarbrücken, 19. March Sequencing Platforms

Basic Bioinformatics: Homology, Sequence Alignment,

NGS sequence preprocessing. José Carbonell Caballero

Introductie en Toepassingen van Next-Generation Sequencing in de Klinische Virologie. Sander van Boheemen Medical Microbiology

Mapping Next Generation Sequence Reads. Bingbing Yuan Dec. 2, 2010

Accelerate precision medicine with Microsoft Genomics

BICF Variant Analysis Tools. Using the BioHPC Workflow Launching Tool Astrocyte

A framework for a general- purpose sequence compression pipeline: a centroid based compression

RNAseq Applications in Genome Studies. Alexander Kanapin, PhD Wellcome Trust Centre for Human Genetics, University of Oxford

BST 226 Statistical Methods for Bioinformatics David M. Rocke. March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1

NEXT GENERATION SEQUENCING. Farhat Habib

Incorporating Molecular ID Technology. Accel-NGS 2S MID Indexing Kits

RNA-seq Data Analysis

Lecture 2: Biology Basics Continued

A Survey on Data Compression Methods for Biological Sequences

Bulked Segregant Analysis For Fine Mapping Of Genes. Cheng Zou, Qi Sun Bioinformatics Facility Cornell University

Introduction to DNA-Sequencing

axe Documentation Release g6d4d1b6-dirty Kevin Murray

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

Variant Callers. J Fass 24 August 2017

Comparing a few SNP calling algorithms using low-coverage sequencing data

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang

Zika infected human samples

Understanding the science and technology of whole genome sequencing

Read Quality Assessment & Improvement. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

Supplementary Figures

Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz

Genomic resources. for non-model systems

SNP calling and VCF format

Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4

European Union Reference Laboratory for Genetically Modified Food and Feed (EURL GMFF)

Dipping into Guacamole. Tim O Donnell & Ryan Williams NYC Big Data Genetics Meetup Aug 11, 2016

A Guide to Consed Michelle Itano, Carolyn Cain, Tien Chusak, Justin Richner, and SCR Elgin.

UAB DNA-Seq Analysis Workshop. John Osborne Research Associate Centers for Clinical and Translational Science

De Novo Assembly of High-throughput Short Read Sequences

HaloPlex HS. Get to Know Your DNA. Every Single Fragment. Kevin Poon, Ph.D.

Bioinformatics for NGS projects. Guidelines. genomescan.nl

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer.

Genomic DNA ASSEMBLY BY REMAPPING. Course overview

Next-Generation Sequencing. Technologies

Deep Sequencing technologies


Developing Tools for Rapid and Accurate Post-Sequencing Analysis of Foodborne Pathogens. Mitchell Holland, Noblis

UNIVERSITY OF KWAZULU-NATAL EXAMINATIONS: MAIN, SUBJECT, COURSE AND CODE: GENE 320: Bioinformatics

Analysis of RNA-seq Data. Feb 8, 2017 Peikai CHEN (PHD)

The Basics of Understanding Whole Genome Next Generation Sequence Data

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

Introduction to RNAseq Analysis. Milena Kraus Apr 18, 2016

MAKING WHOLE GENOME ALIGNMENTS USABLE FOR BIOLOGISTS. EXAMPLES AND SAMPLE ANALYSES.

Genome Assembly Software for Different Technology Platforms. PacBio Canu Falcon. Illumina Soap Denovo Discovar Platinus MaSuRCA.

Differential gene expression analysis using RNA-seq

CS273B: Deep learning for Genomics and Biomedicine

Workflow of de novo assembly

Supplementary Figures and Data

Introduction to NGS analyses

RNA-Seq Analysis. Simon Andrews, Laura v

BIOINFORMATICS ORIGINAL PAPER

Transcription:

INTERNATIONAL ORGANISATION FOR STANDARDISATION ORGANISATION INTERNATIONALE DE NORMALISATION ISO/IEC JTC 1/SC 29/WG 11 CODING OF MOVING PICTURES AND AUDIO ISO/IEC JTC 1/SC 29/WG 11 N15527 Warsaw, CH June 2015 Source: Authors: Title: Requirements Claudio Alberti, Marco Mattavelli (EPFL) Genome Compression 101 - Tutorial on Genome Compression and Storage 1 Introduction This slides set was presented during the 111 th MPEG meeting held in Geneva in February 2015 and published at the 112 th MPEG meeting in Warsaw (June 2015). The aim is to provide an introduction to the current status of genomic information presentation and storage focusing on the sequencing and alignment stages of the genome information life cycle.

Genome Information Compression

The Human Genome 4 possible symbols (bases): A,C,G,T 3.2 billions base pairs for the human genome 3.2 x 2 bit = 6.4 billion bits / 8 = 800 Mbytes So why do we need compression?

Genome sequencing How do we obtain a genome representation from biological samples? Current technology provides random fragments of genome data called reads This process is called genome sequencing

Short and long reads Different sequencing technologies provide Read lengths from about 100 to 20,000+ bases Single or coupled (paired) reads Different accuracy levels (from 60% to 99.9%) Shorter reads are more accurate (up to 99.9%) produced in much larger volumes (currently up to 600 billion bases per single run)

From sequencing to assembly sequencing alignment / mapping

Reference genome Mapped reads Read errors Genomic variants

Coverage Required for clinical usage 150+ layers

All reads must be preserved Is this reading noise or a mutation?

Data volumes 3.2 billion per genome X 200+ for clinical usage (to get at least 150 layers) Additional information 1 quality score per each base (up to 96 possible levels)

Reads quality scores Each base call in a read has a level of confidence The level of confidence is expressed as quality score in a range that is machine dependent represented as ASCII character.

Data volumes 3.2 billion per genome X 200+ for clinical usage (to get at least 150 layers) Additional information 1 quality score (ASCII char) per each base pairing information for coupled reads (labelling)

Paired reads 200 bp 500-1000 bp 200 bp

Data volumes 3.2 billion per genome X 200+ for clinical usage (to get at least 150 layers) Additional information 1 quality score per each base (up to 96 possible levels) pairing information for coupled reads (labelling) Total = 3.2GB x 200 x 2 x labelling ~= 1.5 TB Labelling ~= 1.15

Raw data format FASTQ Field FASTA @HWUSI- EAS100R:6:73:941:197 3#0/1 Header (Unique ID plus other information). Only the first character is standard. >HWUSI- EAS100R:6:73:941:197 3#0/1 GATTTGGGGT.. Nucleotides sequence GATTTGGGGT +SRR001666.1 071112_SLXA- EAS1_s_7 Optional description. Only the first character is standard. This is becoming obsolete Not present!''*((((***+) Quality scores Not present

Example One read: HS2000-1240_45:1:1234:6966:12500 AAATATTTTTTAAAATTAGCCAGGTGTGGTGGTGTGTGCCTATAGTTCCAACTAGTTGTAAAGCTGAAACATAAGGACCACTTGGGTACAGGAGTTCCAA + CCCFFFFFHHHHHIJIIJIIJIJJFHGGIFHHHIIJJJIJIHIJJFIJJIIJJJJHIIIIGIIIIIIFHGGGHFFFFF?BBEECDD?CCCECDD?AC;5@

Example One read: HS2000-1240_45:1:1234:6966:12500/1 AAATATTTTTTAAAATTAGCCAGGTGTGGTGGTGTGTGCCTATAGTTCCAACTAGTTGTAAAGCTGAAACATAAGGACCACTTGGGTACAGGAGTTCCAA + CCCFFFFFHHHHHIJIIJIIJIJJFHGGIFHHHIIJJJIJIHIJJFIJJIIJJJJHIIIIGIIIIIIFHGGGHFFFFF?BBEECDD?CCCECDD?AC;5@ HS2000-1240_45:1:1234:6966:12500/2 ATCAGATGTATAATTTGCAAATAGTTTCTCTCATTCTTTTTTTTTTTTTTTTAGACAGGGTCTCACTGTATTGCCCAGGCTGGAGTGCAGTGGTGCAATC + CCCFFFFFHHHHHJJJJJJJJIIJJJJIJJJJIJIJJJIJJJJJJJJHFDD@################################################

Example A read aligned onto a section of the chromosome 1 of the reference genome. Chr1 39999220 39999240 39999260 39999280 39999300 39999320 39999340............. AATGGTGCAGTCACAGCTGTCTACAAAAATATTTTTTAAAATTAGCCAGGTGTGGTGGTGTGTGCCTATAGTTCCAACTAGTTGTAAAGCTGAAACATAAGGACCACTTGGGTACAGGAGTTCCA AGACTGTGGTGAG...AAATATTTTTTAAAATTAGCCAGGTGTGGTGGTGTGTGCCTATAGTTCCAACTAGTTGTAAAGCTGAAACATAAGGACCACTTGGGTACAGGAGTTCCAA...

Example A read aligned onto a section of the chromosome 1 of the reference genome. Chr1 39999220 39999240 39999260 39999280 39999300 39999320 39999340............. AATGGTGCAGTCACAGCTGTCTACAAAAATATTTTTTAAAATTAGCCAGGTGTGGTGGTGTGTGCCTATAGTTCCAACTAGTTGTAAAGCTGAAACATAAGGACCACTTGGGTACAGGAGTTCCA AGACTGTGGTGAG...AAATATTTTTTAAAATTAGCCAGGTGTGGTGGTGTGTGCCTATAGTTCCAACTAGTTGTAAAGCTGAAACATAAGGACCACTTGGGTACAGGAGTTCCAA... A second read, known to be associated with the first read. 39999340 39999360 39999380 39999400 39999420 39999440 39999 460............. GTGGTGAGCTGTGATTGCACCACTGCACTCCAGCCTGGGCAATACAGTGAGACCCTGTCTAAAAAAAAAAAAAAAAGAATGAGAGAAACTATTTGCAAATTATACATCTGATAAGAGACCTGTAT CTA...GTTTTGCCCCCCTCACCCCAGCCTGGGGAATAAAGGGGGGCCCTTTCTAAAAAAAAAAAAAAAAGAATGAGAGAAACTATTTGCAAATTATACATCTGAT....

Example A read aligned onto a section of the chromosome 1 of the reference genome. Chr1 39999220 39999240 39999260 39999280 39999300 39999320 39999340............. AATGGTGCAGTCACAGCTGTCTACAAAAATATTTTTTAAAATTAGCCAGGTGTGGTGGTGTGTGCCTATAGTTCCAACTAGTTGTAAAGCTGAAACATAAGGACCACTTGGGTACAGGAGTTCCA AGACTGTGGTGAG...AAATATTTTTTAAAATTAGCCAGGTGTGGTGGTGTGTGCCTATAGTTCCAACTAGTTGTAAAGCTGAAACATAAGGACCACTTGGGTACAGGAGTTCCAA... A second read, known to be associated with the first read. 39999340 39999360 39999380 39999400 39999420 39999440 39999 460............. GTGGTGAGCTGTGATTGCACCACTGCACTCCAGCCTGGGCAATACAGTGAGACCCTGTCTAAAAAAAAAAAAAAAAGAATGAGAGAAACTATTTGCAAATTATACATCTGATAAGAGACCTGTAT CTA...GTTTTGCCCCCCTCACCCCAGCCTGGGGAATAAAGGGGGGCCCTTTCTAAAAAAAAAAAAAAAAGAATGAGAGAAACTATTTGCAAATTATACATCTGAT.... HS2000-1240_45:1:1234:6966:12500/2 ATCAGATGTATAATTTGCAAATAGTTTCTCTCATTCTTTTTTTTTTTTTTTTAGACAGGGTCTCACTGTATTGCCCAGGCTGGAGTGCAGTGGTGCAATC Note that the original sequence of the second read had to be reverse complemented to match the genome.

Example A read aligned onto a section of the chromosome 1 of the reference genome. Chr1 39999220 39999240 39999260 39999280 39999300 39999320 39999340............. AATGGTGCAGTCACAGCTGTCTACAAAAATATTTTTTAAAATTAGCCAGGTGTGGTGGTGTGTGCCTATAGTTCCAACTAGTTGTAAAGCTGAAACATAAGGACCACTTGGGTACAGGAGTTCCA AGACTGTGGTGAG...>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >... A second read, known to be associated with the first read. 39999340 39999360 39999380 39999400 39999420 39999440 39999 460............. GTGGTGAGCTGTGATTGCACCACTGCACTCCAGCCTGGGCAATACAGTGAGACCCTGTCTAAAAAAAAAAAAAAAAGAATGAGAGAAACTATTTGCAAATTATACATCTGATAAGAGACCTGTAT CTA...<T<<TGC<<C<CT<<<C<<<<<<<<<<G<<<<A<<G<G<G<<<<T<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<...... Note that the sequence obtained may not reflect exactly the genomic sequence.

Example A read aligned onto a section of the chromosome 1 of the reference genome. Chr1 39999220 39999240 39999260 39999280 39999300 39999320 39999340............. AATGGTGCAGTCACAGCTGTCTACAAAAATATTTTTTAAAATTAGCCAGGTGTGGTGGTGTGTGCCTATAGTTCCAACTAGTTGTAAAGCTGAAACATAAGGACCACTTGGGTACAGGAGTTCCA AGACTGTGGTGAG...>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >... CCCFFFFFHHHHHIJIIJIIJIJJFHGGIFHHHIIJJJIJIHIJJFIJJIIJJJJHIIIIGIIIIIIFHGGGHFFFFF?BBEECDD?CCCECDD?AC;5@ A second read, known to be associated with the first read. 39999340 39999360 39999380 39999400 39999420 39999440 39999 460............. GTGGTGAGCTGTGATTGCACCACTGCACTCCAGCCTGGGCAATACAGTGAGACCCTGTCTAAAAAAAAAAAAAAAAGAATGAGAGAAACTATTTGCAAATTATACATCTGATAAGAGACCTGTAT CTA...<T<<TGC<<C<CT<<<C<<<<<<<<<<G<<<<A<<G<G<G<<<<T<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<...... ################################################@DDFHJJJJJJJJIJJJIJIJJJJIJJJJIIJJJJJJJJHHHHHFFFFFCCC Here, the quality of the sequence obtained, associated to each nucleotide has been flagged as poor in the region where it does not match well the reference genome.

Demo on FastQ

Genome processing pipeline

FastQ compression today Gzip of the entire txt file (sometimes split into several files) Compression ratio : 3 to 5 According to the coverage 1 genome can take up to 2 TB

From sequencing to assembly FastQ sequencing SAM Sequence Alignment Mapping alignment / mapping

Alignment / mapping Raw data + alignment information

Position index reference genome FastQ reads FastQ headers Paired reads Positions Indels Base sequences SAM FastQ Headers Quality scores if present

Compressed SAM = BAM BAM = Block based zipped SAM Indexable for random access Compression ratio over textual SAM ~= 4 to 6 Example (coverage 200x) Fastq = 1.5 TB Fastq.gz = 370 GB Fastq.bzip = 255 GB quip = 205 GB (best compression tool for FastQ) SAM = ~3 TB BAM = 500 GB

SAM/BAM view demo Human sample from the MPEG dataset /human/illumina/lowcoverage/na21144.chrom11 Chromosome 11 Reads length: 100 bases No. of reads: ~10.1 millions Unaligned data: 2.2 GB Aligned SAM: 4.5 GB Compressed BAM: 1 GB

Raw sequence data + Aligned data Current focus

Thank you