Alignment. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

Similar documents
Alignment & Variant Discovery. J Fass UCD Genome Center Bioinformatics Core Tuesday June 17, 2014

Introduction to Short Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

Variant Finding. UCD Genome Center Bioinformatics Core Wednesday 30 August 2016

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

RNAseq and Variant discovery

Variant Analysis. CB2-201 Computational Biology and Bioinformatics! February 27, Emidio Capriotti!

Bioinformatics in next generation sequencing projects

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

Mapping Next Generation Sequence Reads. Bingbing Yuan Dec. 2, 2010

Genome 373: Mapping Short Sequence Reads II. Doug Fowler

Introduc)on to Genomics

Exome Sequencing and Disease Gene Search

Prioritization: from vcf to finding the causative gene

Read Mapping and Variant Calling. Johannes Starlinger

About Strand NGS. Strand Genomics, Inc All rights reserved.

Alignment methods. Martijn Vermaat Department of Human Genetics Center for Human and Clinical Genetics

Variant Detection in Next Generation Sequencing Data. John Osborne Sept 14, 2012

Analysis Datasheet Exosome RNA-seq Analysis

Sanger vs Next-Gen Sequencing

Variant calling in NGS experiments

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

From raw reads to variants

NGS in Pathology Webinar

NEXT GENERATION SEQUENCING. Farhat Habib

Genomic DNA ASSEMBLY BY REMAPPING. Course overview

Introduc)on to Bioinforma)cs of next- genera)on sequencing. Sequence acquisi)on and processing; genome mapping and alignment manipula)on

SNP calling. Jose Blanca COMAV institute bioinf.comav.upv.es

Next Generation Sequencing: Data analysis for genetic profiling

Gene Expression analysis with RNA-Seq data

UAB DNA-Seq Analysis Workshop. John Osborne Research Associate Centers for Clinical and Translational Science

Short Read Alignment to a Reference Genome

Exploring structural variation in the tomato genome with JBrowse

SNP detection in allopolyploid crops

SNP calling and VCF format

Genomic Technologies. Michael Schatz. Feb 1, 2018 Lecture 2: Applied Comparative Genomics

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler

Next-Generation Sequencing. Technologies

Introduction to Next Generation Sequencing (NGS) Andrew Parrish Exeter, 2 nd November 2017

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech

Variation detection based on second generation sequencing data. Xin LIU Department of Science and Technology, BGI

Introduction to RNA-Seq in GeneSpring NGS Software

Identifying copy number alterations and genotype with Control-FREEC

Galaxy for Next Generation Sequencing 初探次世代序列分析平台 蘇聖堯 2013/9/12

Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4

Reads to Discovery. Visualize Annotate Discover. Small DNA-Seq ChIP-Seq Methyl-Seq. MeDIP-Seq. RNA-Seq. RNA-Seq.

10/06/2014. RNA-Seq analysis. With reference assembly. Cormier Alexandre, PhD student UMR8227, Algal Genetics Group

Variant Callers. J Fass 24 August 2017

Next Generation Sequencing: An Overview

Variant Discovery. Jie (Jessie) Li PhD Bioinformatics Analyst Bioinformatics Core, UCD

RNA-Seq Module 2 From QC to differential gene expression.

Read Quality Assessment & Improvement. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

BST 226 Statistical Methods for Bioinformatics David M. Rocke. March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1

QIAseq Targeted Panel Analysis Plugin USER MANUAL

MPG NGS workshop I: SNP calling

Reference genomes and common file formats

NGS Data Analysis and Galaxy

IDENTIFYING A DISEASE CAUSING MUTATION

Introduction to RNAseq Analysis. Milena Kraus Apr 18, 2016

Analysis of RNA-seq Data. Feb 8, 2017 Peikai CHEN (PHD)

Quantifying gene expression

Introduction to Next Generation Sequencing

Introduction to NGS analyses

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

Comparing a few SNP calling algorithms using low-coverage sequencing data

RNAseq Applications in Genome Studies. Alexander Kanapin, PhD Wellcome Trust Centre for Human Genetics, University of Oxford

CNV and variant detection for human genome resequencing data - for biomedical researchers (II)

BIOINFORMATICS. Lacking alignments? The next-generation sequencing mapper segemehl revisited

Analytics Behind Genomic Testing

BIGGIE: A Distributed Pipeline for Genomic Variant Calling

Genome STRiP ASHG Workshop demo materials. Bob Handsaker October 19, 2014

Reference genomes and common file formats

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang

Variant detection analysis in the BRCA1/2 genes from Ion torrent PGM data

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

HiSeq Whole Exome Sequencing Report. BGI Co., Ltd.

RNA Seq: Methods and Applica6ons. Prat Thiru

RNA-seq Data Analysis

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Read Complexity

Course Presentation. Ignacio Medina Presentation

BST227 Introduction to Statistical Genetics. Lecture 8: Variant calling from high-throughput sequencing data

Distributed Pipeline for Genomic Variant Calling

Next Generation Sequencing. Tobias Österlund

Normal-Tumor Comparison using Next-Generation Sequencing Data

Data Analysis Report: Variant Analysis v1.2

Data Analysis with CASAVA v1.8 and the MiSeq Reporter

VM origin. Okeanos: Image Trinity_U16 (upgrade to Ubuntu16.04, thanks to Alexandros Dimopoulos) X2go: LXDE

Bulked Segregant Analysis For Fine Mapping Of Genes. Cheng Zou, Qi Sun Bioinformatics Facility Cornell University

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

L3: Short Read Alignment to a Reference Genome

RNA-Seq Software, Tools, and Workflows

RNA Expression Time Course Analysis

BICF Variant Analysis Tools. Using the BioHPC Workflow Launching Tool Astrocyte

Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer. Project XX1001. Customer Detail

Genomic Dark Matter: The limitations of short read mapping illustrated by the Genome Mappability Score (GMS)

Transcriptome analysis

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Illumina Read QC. UCD Genome Center Bioinformatics Core Monday 29 August 2016

Mining GWAS Catalog & 1000 Genomes Dataset. Segun Fatumo

Virus-Clip: a fast and memory-efficient viral integration site detection tool at single-base resolution with annotation capability

Transcription:

Alignment J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

From reads to molecules

Why align? Individual A Individual B ATGATAGCATCGTCGGGTGTCTGCTCAATAATAGTGCCGTATCATGCTGGTGTTATAATCGCCGCATGACATGATCAATGG CAATAAAAGTGCCGTATCATGCTGGTGTTACAATCGCCGCA CGTATCATGCTGGTGTTACAATCGCCGCATGACATGATCAATGG TGTCTGCTCAATAAAAGTGCCGTATCATGCTGGTGTTACAATC ATCGTCGGGTGTCTGCTCAATAAAAGTGCCGTATCATG--GGTGTTATAA CTCAATAAGAGTGCCGTATCATG--GGTGTTATAATCGCCGCA GTTATAATCGCCGCATGACATGATCAATGG To measure variation.

Why align?

Why align?

Short Read Aligners: choices... Fall '12 - Apr '13:... now 150-180 Gbp / day!* * http://www.illumina.com/systems/hiseq_2500_1500/performance_specifications.ilmn

Burrows-Wheeler Aligners Burrows-Wheeler Transform used in bzip2 file compression tool; FM-index (Ferragina & Manzini) allow efficient finding of substring matches within compressed text algorithm is sub-linear with respect to time and storage space required for a certain set of input data (reference 'ome, essentially). Reduced memory footprint, faster execution.

BWA BWA is fast, and can do gapped alignments. When run without seeding, it will find all hits within a given edit distance. Long read aligner is also fast, and can perform well for 454, Ion Torrent, Sanger, and PacBio reads. BWA is actively developed and has a strong user / developer community. bio-bwa.sourceforge.net Short reads under 200 bp Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-60. [PMID: 19451168] Long reads over 200 bp chimeric alignments built-in Li H. and Durbin R. (2010) Fast and accurate long read alignment with Burrows-Wheeler Transform. Bioinformatics, 26:589-95. [PMID: 20080505] don't forget to join the mailing groups!

Bowtie Bowtie (now Bowtie 2) is probably faster than BWA for some types of alignment, but it may not find the best alignments (see discussions on sensitivity, accuracy on SeqAnswers.com). Bowtie is part of a suite of tools (Bowtie, Tophat, Cufflinks, CummeRbund) that address RNAseq experiments. http://bowtie-bio.sourceforge.net Langmead B., Trapnell C., Pop M., and Salzberg S.L. (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Genome Biology 10:R25 [PMID: 19261174] don't forget to join the mailing groups!

Alignment concepts / parameters Paired-End reads Mate-Paired reads

Alignment concepts / parameters 454 "Paired-End" reads Single End Construct

Alignment concepts / parameters

Alignment concepts / parameters

Alignment concepts / parameters

Alignment concepts / parameters

File Format: SAM / BAM / CRAM! NEW http://samtools.sourceforge.net/ - deprecated! http://www.htslib.org/ - SAMtools 1.0 and up Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943] SAM specification (currently v1, renumbered 1 is after old v1.4) samtools man page example workflow(s) mailing list!

File Format: SAM

File Format: SAM SAM Format Specification v1.4-r985 7,8 - formerly MRNM, MPOS (mate reference name, mate position) 9 - formerly ISIZE ("insert" size)

File Format: SAM google "Heng Li slides" - Challenges and Solutions in the Analysis of Next Generation Sequencing Data (2010)

File Format: BAM BAMs are compressed SAMs (so, binary, not human-readable text don't look directly at them!). They can be indexed to allow rapid extraction of information, so alignment viewers do not need to uncompress the whole BAM file in order to look at information for a particular read or coordinate range, somewhere in the file. Indexing your BAM file, mycoolbamfile.bam, will create an index file, mycoolbamfile.bam.bai, which is needed (in addition to the BAM file) by viewers and other downstream tools. An occasional downstream tool will require an index called mycoolbamfile.bai (notice that the.bai replaces the.bam, instead of being appended after it).

File Format: CRAM Available as of SAMtools 1.0, and is a binary format like BAM. Uses data-specific compression tools (i.e. compressing letters is different than compressing numbers), specifically reference-based compression (e.g. for aligned reads, only mis-matching bases need to be stored). Also can employ lossy compression of base qualities, which appears to have a negligible effect on, say, variant calling (see Illumina white paper). Indexing your CRAM file, mycoolbamfile.cram, will create an index file, mycoolbamfile.cram.crai, which is needed (in addition to the CRAM file) by viewers and other downstream tools. This is a very recent development, so it may be a while before tools are CRAM-capable.

Alignment Viewers IGV (Integrated Genomics Viewer) www.broadinstitute.org/igv/ BAMview, tview (in SAMtools), IGB, GenomeView, SAMscope... UCSC Genome Browser, GBrowse

IGV red box indicates region of reference in view below coverage track: read coverage depth plot read alignments: (various view styles - squished shown here) read positions, orientations, pairing, sequence that disagrees with reference highlighted, improper pairs highlighted, etc. annotation tracks (GTF, BED, etc.)

IGV colored bases where they disagree with reference (substitution, indel, etc.) improper pairs (mate aligns far away, in wrong orientation, or on another chromosome) reference sequence, reading frames, etc.

Variant Calling - VCF format One main application of read alignment. A.k.a. "resequencing", SNP / indel discovery. VCF (variant call format) is now the standard format for variant reporting. http://vcftools.sourceforge.net/specs.html... VCF poster

Variant Call Format ##fileformat=vcfv4.1 ##filedate=20130825 ##source=freebayes v9.9.2-9-gfbf46fc-dirty ##reference=../results/8/8.fa ##phasing=none ##commandline="../tools/freebayes/bin/freebayes -f../results/8/8.fa --min-alternate-fraction 0.03 --minmapping-quality 20 --min-base-quality 20 --ploidy 1 --pooled-continuous --use-best-n-alleles 4 --usemapping-quality --min-alternate-fraction 0.04 --min-alternate-count 1../results/8/8.bam" ##INFO=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count, with partial observations recorded fractionally"> ##INFO=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observations, with partial observations recorded fractionally"> ##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex."> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype"> ##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count"> ##FORMAT=<ID=QR,Number=1,Type=Integer,Description="Sum of quality of the reference observations"> ##FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observation count"> ##FORMAT=<ID=QA,Number=A,Type=Integer,Description="Sum of quality of the alternate observations"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 8 8_PB1 26. TGTTACGCG GCTTTTGC,TGTTTCTAC 27.2619. AO=1,2;RO=0;TYPE=complex, complex GT:DP:RO:QR:AO:QA:GL 2:3:0:0:1,2:31,70:-4.46,-1.65,0 8_PB1 38. TCA ACG,TA,AGA 0.0495692. AO=1,1,1;RO=3;TYPE=complex,del,mnp GT:DP:RO:QR:AO:QA:GL 2:6:3:101:1,1,1:31,37,34:0,-4.556,-4.004,-4.28 8_PB1 42. G A 3.94171e-14. AO=8;RO=128;TYPE=snp GT:DP:RO:QR:AO:QA: GL

Variant Call Format ##fileformat=vcfv4.1 ##filedate=20130825 ##source=freebayes v9.9.2-9-gfbf46fc-dirty ##reference=../results/8/8.fa ##phasing=none ##commandline="../tools/freebayes/bin/freebayes -f../results/8/8.fa --min-alternate-fraction 0.03 --minmapping-quality 20 --min-base-quality 20 --ploidy 1 --pooled-continuous --use-best-n-alleles 4 --usemapping-quality --min-alternate-fraction 0.04 --min-alternate-count 1../results/8/8.bam" ##INFO=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count, with partial observations recorded fractionally"> ##INFO=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observations, with partial observations recorded fractionally"> ##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex."> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype"> ##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count"> ##FORMAT=<ID=QR,Number=1,Type=Integer,Description="Sum of quality of the reference observations"> ##FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observation count"> ##FORMAT=<ID=QA,Number=A,Type=Integer,Description="Sum of quality of the alternate observations"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 8 8_PB1 26. TGTTACGCG GCTTTTGC,TGTTTCTAC 27.2619. AO=1,2;RO=0;TYPE=complex, complex GT:DP:RO:QR:AO:QA:GL 2:3:0:0:1,2:31,70:-4.46,-1.65,0 8_PB1 38. TCA ACG,TA,AGA 0.0495692. AO=1,1,1;RO=3;TYPE=complex,del,mnp GT:DP:RO:QR:AO:QA:GL 2:6:3:101:1,1,1:31,37,34:0,-4.556,-4.004,-4.28 8_PB1 42. G A 3.94171e-14. AO=8;RO=128;TYPE=snp GT:DP:RO:QR:AO:QA: GL

Variant Call Format #CHROM POS ID REF 8_PB2 407. A 170:21:788:149:5579:-5,0 CHROM = 8_PB2 POS = 407 ID =. REF = A ALT = G QUAL = 3935.83 ALT G QUAL FILTER 3935.83. INFO FORMAT 8 AO=149;RO=21;TYPE=snp GT:DP:RO:QR:AO:QA:GL 1: FILTER =. INFO = AO=149;RO=21;TYPE=snp FORMAT = GT:DP:RO:QR:AO:QA:GL 8 = 1:170:21:788:149:5579:-5,0

Variant Call Format

Variant Call Format

Variant Call Format

Variant Call Format #CHROM POS ID REF 8_PB2 407. A 170:21:788:149:5579:-5,0 CHROM = 8_PB2 POS = 407 ID =. REF = A ALT = G QUAL = 3935.83 ALT G QUAL FILTER 3935.83. INFO FORMAT 8 AO=149;RO=21;TYPE=snp GT:DP:RO:QR:AO:QA:GL 1: ##FORMAT=<ID=DP,Number=1,Type=Integer, Description="Read Depth"> FILTER =. INFO = AO=149;RO=21;TYPE=snp FORMAT = GT:DP:RO:QR:AO:QA:GL 8 = 1:170:21:788:149:5579:-5,0

Variant Call Format ##INFO=<ID=RO,Number=1,Type=Integer,Description=" Reference allele observation count, with partial observations recorded fractionally"> ##INFO=<ID=AO,Number=A,Type=Integer,Description=" Alternate allele observations, with partial observations recorded fractionally"> ##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">

Variant Call Format ##FORMAT=<ID=GT,Number=1,Type=String,Description=" Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Float,Description=" Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype"> ##FORMAT=<ID=GL,Number=G,Type=Float,Description=" Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description=" Read Depth">

Variant Call Format ##FORMAT=<ID=RO,Number=1,Type=Integer,Description=" Reference allele observation count"> ##FORMAT=<ID=QR,Number=1,Type=Integer,Description=" Sum of quality of the reference observations"> ##FORMAT=<ID=AO,Number=A,Type=Integer,Description=" Alternate allele observation count"> ##FORMAT=<ID=QA,Number=A,Type=Integer,Description=" Sum of quality of the alternate observations">

Variant Effect Prediction snpeff Variant Effect Predictor (EMBL) SIFT

VCF after Effect Prediction #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 8 8_PB2 407. A G 3935.83. AO=149;RO=21;TYPE=snp;EFF=SYNONYMOUS_CODING (LOW SILENT gaa/gag E123 759 PB2 CODING Tr_PB2 1 1) GT:DP:RO:QR:AO:QA:GL 1:170:21:788:149:5579:-5,0 CHROM = 8_PB2 POS = 407 ID =. REF = A ALT = G QUAL = 3935.83 FILTER =. INFO = AO=149;RO=21;TYPE=snp;EFF=SYNONYMOUS_CODING (LOW SILENT gaa/gag E123 759 PB2 CODING Tr_PB2 1 1) FORMAT = GT:DP:RO:QR:AO:QA:GL 8 = 1:170:21:788:149:5579:-5,0

VCF after Effect Prediction ##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex."> ##INFO=<ID=EFF,Number=.,Type=String,Description="Predicted effects for this variant.format: 'Effect ( Effect_Impact Functional_Class Codon_Change Amino_Acid_change Amino_Acid_length Gene_Name Transcript_BioType Gene_Coding Transcript_ID Exon GenotypeNum [ ERRORS WARNINGS ] )' "> INFO = AO=149;RO=21;TYPE=snp; EFF=SYNONYMOUS_CODING (LOW SILENT gaa/gag E123 759 PB2 CODING Tr_PB2 1 1)