Introduc0on to Variant Analysis with NGS data

Size: px
Start display at page:

Download "Introduc0on to Variant Analysis with NGS data"

Transcription

1 Introduc0on to Variant Analysis with NGS data Lecture by: Date: Lecture series: Study program: Dr. Chris0an Rausch 3 November 2014 Tumor Biology and Clinical Behavior VUmc Master of Oncology About Chris0an Rausch 2001 M.Sc. in Biotechnology, ESBS, Université Louis Pasteur, Strasbourg 2007 Ph.D. in Bioinforma0cs, Universität Tübingen Postdoc Protein Engineering, Technische Universität München Associate Scien0st, DSM, Del[ since Senior Bioinforma0cian and Coordinator CCA Drylab, Amsterdam 1

2 09/03/15 CCA Drylab: the bioinforma0cs hub at the VUmc Cancer Center Amsterdam Wim van Criekinge Mark van de Wiel Daphne v. Beek Henk Verheul Jeroen Beliën Erik Sistermans Connie Jimenez Frits Peters Josephine Dorsman Victor van Beusechem Jacqueline Cloos Elisa Giovannee Michiel Pegtel Remond Fijneman Danijela Koppers- Lalic Bauke Ylstra Daniëlle Heideman Ruud Brakenhoff Renske Steenbergen Daphne de Jong Focus of lecture and prac0cal part Lecture: from NGS data to Variant analysis Hands on training: we will analyze NGS read data of a panel of cancer genes (Illumina TruSeq Amplicon - Cancer Panel) of colon cancer cell line CaCo- 2. Analysis so[ware tools will be run interac0vely through Galaxy, a web- based plamorm for data intensive biomedical research Image source: A survey of tools for variant analysis of next- genera0on genome sequencing data, Pabinger et al., Brief Bioinform (2013) doi: /bib/bbs086 2

3 Recap sequencing (Illumina technology) slide 1 Image source: openwetware.org/images/7/7a/doe_jgi_illumina_hiseq_handout.pdf Recap sequencing (Illumina technology) slide 2 Image source: openwetware.org/images/7/7a/doe_jgi_illumina_hiseq_handout.pdf 3

4 Recap sequencing (Illumina technology) slide 3 Image source: openwetware.org/images/7/7a/doe_jgi_illumina_hiseq_handout.pdf Recap sequencing (Illumina technology) slide 4 Image source: illumina.com To be able to sequence paired reads, a second bridge amplifica0on followed by a flip of the template is required 4

5 Comparison Sequencing Plamorms Pla+orm Illumina (HiSeq, MiSeq etc) Life Tech Ion Torrent / Proton Amplifica1on Method bridge PCR emulsion PCR Sequencing Method Detec1on Method Average read length sequencing by synthesis Ion semiconductor sequencing Roche 454 emulsion PCR Pyrosequencing, cleavage of released pyrophosphate Life Tech SOLiD emulsion PCR sequencing by liga0on of hybridizing labeled oligos Pacific Biosciences PacBio Oxford Nanopore MinION No amplifica0on, single- molecule sequencing No amplifica0on, single molecule nanopore sequencing polymerase incorpora0ng colored NTPs DNA molecule traverses pore Light ph light light light current bp bp 700 bp 100 bp kb > 5.4 kb Further reading, great lecture: Sequencing technology - Past, Present and Future, hwp:// Nanopore Sequencing In development since 1995 Company: Oxford Nanopore First working development stage devices (MinION) released to tes0ng groups Image source: John MacNeill, hwp://www2.technologyreview.com/ar0cle/427677/nanopore- sequencing 5

6 Selec0ng parts of the genome for sequencing The Illumina TruSeq Amplicon - Cancer Panel uses Mul0plex PCR to amplify a selected part of the genome (a selec0on of the exons of 48 genes are targeted with 212 amplicons) Proper0es of Reads (Illumina) Typical read length: bp Paired reads: Insert size bp Mate pairs: Insert size several kbp Depending on which Illumina plamorm is used, the read quality drops a[er 100, 150 or 200 bp Errors in Illumina reads are typically subs0tu0on errors Source: evomics.org/2014/01/ alignment- methods/ Image source: Mate Pair v2 Sample Prep Guide For 2-5 kb Libraries 6

7 Quality Measure: Phred Score Phred quality scores were originally developed by the base calling program Phred used with Sanger sequencing data Phred quality score Q is defined as a property which is logarithmically related to the base- calling error probability P Q = - 10 log 10 P Example Phred score 30 = error rate 10-3 = 1 base in 1000 will be wrong Illumina s Q score = Phred score The base calling programs that convert raw data to sequence data (the base callers ) need to be trained to give realis0c quality values standard format to sequencing reads with quality informa0on ( Q stands for iden0fiers and op0onal descrip0ons of the sequence the actual DNA sequence + separator op0onally followed by descrip0on The quality values of the sequence (one character per nucleo0de) More info see wikipedia FASTQ_format standard format to store sequence data (DNA and protein seq.) >FASTA header, o[en contains unique iden0fiers and descrip0ons of the sequence format 7

8 Quality control with FastQC Need to check the quality of reads before further analysis Program FastQC is quasi standard Sequencing plamorm companies provide also their own tools for quality control Quality control: FastQC Encoding tells which set of characters stands for which Phred scores. There s also Encoding Illumina 1.5 and others. Other programs might not automa0cally recognize the encoding In Galaxy there is a possibility to set the encoding of a FastQ file via the pen symbol. 8

9 Examples of per base sequence quality Not so good, might s0ll be usable, depending on applica0onà 50 bp ß Historical example of very first Solexa reads (Solexa acquired by Illumina 2007) Examples of other quality measures in FastQC Upper 4 graphs from the data set of the prac0cal course: Many reads are repeated Apparently not uniformly distributed over whole genome Overrepresented sequences: Sequenced fragment was too short and sequencing reac0on ran into the Adapter/PCR primer 9

10 Mapping reads to the reference is finding where their sequence occurs in the genome 100 bp iden0fied bp unknown sequence 100 bp iden0fied Source: Wikimedia, file:mapping Reads.png Mapping reads to the reference : naïve text search algorithms are too slow Naïve approach: compare each read with every posi0on in the genome Takes too long, will not find sequences with mismatches Search programs typically create an index of the reference sequence (or text) and store the reference sequence (text) in an advanced data structure for fast searching. An index is basically like a phone book (with addresses) à Quickly find address (loca0on) of a person Example of algorithm using indexed seed tables to quickly find loca0ons of exact parts of a read 10

11 Mapping reads to the reference : frequently used programs BLAST, the most famous bioinforma0cs program since 1990, is used to find similar sequences in DNA and protein data bases sec to find a result Mapping 60 million reads would take ~ 2 months on one CPU 1 à too slow for NGS Popular tools for read mapping: Bow0e, Bow0e2, BWA, SOAP2, MAQ, RMAP, GSNAP, Novoalign, and mrsfast/mrfast: Hatem et al. BMC Bioinforma0cs 2013, 14:184 hwp:// CLCbio read mapper (commercial) No tool is the best tool in all example condi0ons differences in speed Differently op0mized for mismatches/gap models/inser0ons & Dele0ons/ taking into account read base quality/local realignment of matches etc. Read Mappers: BWA and Bow0e2 Are based on the Burrows- Wheeler Transforma0on (BWT) BWT: special sor0ng of all lewers in the text (sequence) Similar suffixes (word ends) will be close to each other Easier to compress Good for approximate string matching (sequence alignment) Index (FM index) for finding the loca0ons of matched strings (sequences) in the genome 11

12 Read Mapping: General problems Read can match equally well at more than one loca0on (e.g. repeats, pseudo- genes) Even fit less well to it s actual posi0on, e.g. if it carries a break point, inser0ons and/or dele0ons Workflow: Mapping reads with BWA Input reads (fastq files) Quality check with FastQC Not OK? Quality- & Adapter- trimming OK? Map reads to reference genome using e.g. BWA or Bow0e2 Output: Sorted BAM file (binary SAM sequence alignment map) Sort by coordinates using SAMtools sort 12

13 SAM and BAM files SAM = Sequence Alignment Map BAM = Binary SAM = compressed SAM Sequence Alignment/Map format contains informa0on about how sequence reads map to a reference genome Requires ~1 byte per input base to store sequences, quali0es and meta informa0on. Supports paired- end reads and color space from SOLiD. Is produced by bow0e, BWA and other mapping tools Partly from: gene0cs.stanford.edu/gene211/lectures/lecture3_resequencing_func0onal_genomics pdf Example from: gene0cs.stanford.edu/gene211/lectures/lecture3_resequencing_func0onal_genomics pdf 13

14 Harves0ng Informa0on from SAM Query name, QNAME (SAM) / read_name (BAM). FLAG provides the following informa0on: are there mul0ple fragments? are all fragments properly aligned? is this fragment unmapped? is the next fragment unmapped? is this query the reverse strand? is the next fragment the reverse strand? is this the last fragment? is this a secondary alignment? did this read fail quality controls? is this read a PCR or op0cal duplicate Source: Variant Calling & Annota0on 14

15 Possible reasons for a mismatch True SNP Error generated in library prepara0on Base calling error May be reduced by bewer base calling methods, but cannot be eliminated Misalignment (mapping error): Local re- alignment to improve mapping Error in reference genome sequence Partly from Variant Calling: Principles Naïve approach (used in early NGS studies): Filter base calls according to quality Filter by frequency Typically, a quality Filter of PHRED Q 20 was used (i.e., probability of error 1% ). Then, the following frequency thresholds were used according to the frequency of the non- ref base, f(b): The frequency heuris0c works well if the sequencing depth is high, so that the probability of a heterozygous nucleo0de falling outside of the 20% - 80% region is low. Problems with frequency heuris0c: For low sequencing depth, leads to undercalling of heterozygous genotypes Use of quality threshold leads to loss of informa0on on individual read/base quali0es Does not provide a measure of confidence in the call In parts from: compbio.charite.de/contao/index.php/genomics.html 15

16 Variant Calling: Principles Today s Variant Callers rely on probability calcula0ons Use of Bayes Theorem: E.g. MAQ: One of the first widely used read mappers and variant callers Takes into account a quality score for whole read alignment & quality of base at the individual posi0on Calls the most likely genotype given observed subs0tu0ons Reliability score can be calculated Variant Calling & Annota0on: Popular Tools SAMtools (Mpileup & Bc[ools) GATK Varscan2 Freebayes MAQ 16

17 VCF = Variant Call Format Variant Call Format / BCF = binary version dbsnp and snpeff dbsnp = the Single Nucleo0de Polymorphism NCBI Different collec0ons of SNPs are available: all humans, human subpopula0ons, different clinical significance ( docs/human_varia0on_vcf). snpeff is a program that can annotate a collec0on of SNVs according to informa0on available in dbsnp and informa0on extracted from the loca0on of the SNV (Exon, Intron, silent/ non- sense muta0on etc.) 17

18 Variant Calling & Annota0on pipeline (prac0cal course) Mapped input reads (as SAM or BAM files) SAMtools Mpileup Analyze mismatches & compute likelihoods of SNP etc. BcAools view does the actual calling concatenate BCF files index BCFs for fast access convert BCF to VCF SnpSiA Filter e.g. according to quality, coverage etc. SnpSiA Annotate according to info from dbsnp SnpEff Annotate the effect of the change (e.g. stop/ frameshi[ muta0on, silent muta0on), loca0on in exon/intron/ splice site etc. 18

19 d 19