IDENTIFYING A DISEASE CAUSING MUTATION

Size: px
Start display at page:

Download "IDENTIFYING A DISEASE CAUSING MUTATION"

Transcription

1 IDENTIFYING A DISEASE CAUSING MUTATION Targeted resequencing MARCELA DAVILA 3/MZO/2016

2 Core Facilities at Sahlgrenska Academy

3 Statistics Software

4 Increasing statistical and bioinformatics knowledge Personalized training (software/programming) Courses Genomics and Advanced NGS data analysis Perl for life science researchers Unix with applications to NGS data Seminars and workshops

5 Supporting local bioinformaticians

6 Sanger sequencing Dye-labeled terminator Capillar electrophoresis C h r o m a t o DNA template Laser beam g r a m

7 Next Generation Sequencing Roche 454 Solexa Illumina SOLiD Life Technologies Ion Torrent Life Technologies Qiagen DNA library preparation Amplification (epcr, bridge PCR) Sequencing reaction Imaging Decoding

8 Third Generation Sequencing Single molecule- real time No optics Increased sequencing speed Nanopore Ion Torrent SMRT

9 Sequencing Costs 2nd NGS Sanger HGP 3rd NGS HiseqX

10 Illumina workflow Library prep Cluster generation Sequencing, imaging and base calling Fastq files

11 Fastq format :instrument:run:flowcell:lane:tile:x:y pair:fail:control:index 2) sequence 3) marker 4) quality 1:N:0:TACAGC 2) GCATTTTAGTAGAACCAGNCATTTCCCCCNACNTCNNTNCGNNANNNNTAA 3)

12 Phred quality score Probability that the base has been erroneously called Phred score P(called wrong) 10 1 in 10 90% 20 1 in % 30 1 in ,9% Accuracy base call 40 1 in ,99% 50 1 in ,999% Phred = 10 LIMS Phred = 50

13 Sequencing run Quality QScore Distribution A succesful run should have 80% >= Q30 Illumina Sequencing Analysis Viewer

14 Sequencing run Quality Data by Cycle Illumina Sequencing Analysis Viewer

15 Sequencing run Quality Demultiplexing Illumina Sequencing Analysis Viewer - CASAVA

16 Different recepies Single end (SE) Paired-end (PE) R1 R bp Mate-pair (MP) R1 R2 2-5 kb

17 Data handling workflow Exome-seq Quality Check Quality Filter Mapping to ref genome De novo assembly Methyl-seq SNV detection CpG patterns Metagenomics ChIP-Seq RNA-seq Taxa summary Peak detection Transcript estimation

18 Quality check with FastQC

19 Per base sequence content A Exome-seq BS-seq C B bad adapter clipping RNA-seq D

20 Per base sequence content Amplicon seq

21 Human Mouse Rat Ecoli Yeast PhiX Adapters Vectors Wasp No hits Human Mouse Rat Ecoli Yeast PhiX Adapters Vectors Wasp No hits %mapped Contamination check with FastQScreen Good Bad

22 Quality Filter with PRINSeq Ambiguous 1:N:0: GCATTTTAGTAGAACCAGNCATTTCCCCCNACNTCNNTNCGNNANNNNTAA X nts Low quality TRIM-galore

23 Data handling workflow Quality Check Quality Filter Exome-seq Mapping to ref genome De novo assembly Methyl-seq SNV detection CpG patterns Metagenomics ChIP-Seq RNA-seq Taxa summary Peak detection Transcript estimation

24 Mapping CTACTACATCGATCTACGCAGCTACTACACGTGCTGGGACGC REF TCGATCGACG CACGTGCTGG ATCGAGCGAC TGCTGGAACGC TACATCGATC CACGTGCTGGAAC CTACTACA TCGACGC CTACTACA GGAACGC CTACT CGACGCA CTACT TGGAACGC READS WHERE to place the reads? a) Unique reads b) Everywhere possible c) Choose one randomly d) Use pair-end data mean DNA fragment size: 40 Bfast, BioScope, Bowtie, BWA, CLC bio, CloudBurst, Eland/Eland2, GenomeMapper, GnuMap, Karma, MAQ, MOM, Mosaik, MrFAST/MrsFAST, NovoAlign, PASS, PerM, RazerS, RMAP, SSAHA2, Segemehl, BAM files

25 Mapping HOW to place the reads? Ungapped, Gapped RNA-seq Exome

26 IGV Integrative Genome Viewer coverage reads My BAM gene

27 UCSC browser gene variants My BAM My VCF Variation track

28 Data handling workflow Quality Check Quality Filter Exome-seq Mapping to ref genome De novo assembly Methyl-seq SNV detection CpG patterns Metagenomics ChIP-Seq RNA-seq Taxa summary Peak detection Transcript estimation

29 Monogenic diseases Modifications of a single gene over 10,000 of human diseases (½ have a gene associated) DISEASE GENE MUTATION Thalassaemia HBB frameshift Sickle cell anemia HBB G6V Cystic Fibrosis CFTR G542X Fragile X syndrome FMR1 CGG expansion Huntington s HTT CAG +36 repeats UTRs Coding regions Splice sites/branch site

30 Enrichment kits

31 Enrichment kits mirna UTR s

32 Realignment and recalibration Correct alignments due to the presence of indels Differenciate between polymorphisms and sequencing errors GATK

33 Variant calling CTACTACATCGATCTACGCAGCTACTACACGTGCTGGGACGC REF TCGATCGACG CACGTGCTGG CTACT CGACGCAGATACT ATCGAGCGAC TGCTGGAACGC TACATCGATC CACGTGCTGGAAC CTACTACA TCGACGC CTACTACA READS Is it a variant? What is the most likely genotype? SOAP2, samtools, GATK, Beagle, CRISP, Dindel, FreeBayes, SeqEM, VarScan, Mutect VCF files

34 Amplicon, quite noisy Variant calling

35 BODY HEADER VCF format Variant call format

36 Variant annotation CTACTACATCGATCTACGCAGCTACTACACGTGCTGGGACGC REF TCGATCGACG CACGTGCTGG CTACT CGACGCAGATACT ATCGAGCGAC TGCTGGAACGC TACATCGATC CACGTGCTGGAAC CTACTACA TCGACGC CTACTACA In which gene is it located? Name, Description, OMIM, Pathway, GO, Expression profiles... Is there any AA change? GAA -> GAG = E->E GTT -> CTT = V->L TGG -> TGA = W->X TGA -> CGA = X->R Where in the gene is it located? Intron, exon, UTR, intergenic region, splice site What impact does the AA change have? Damaging, benign Is it a known SNP? CSV files READS Annovar, SIFT, PP2, dbsnp, GO, KEGG, OMIM 1000G

37 Coverage

38 Variant Analysis Ingenutity Variant Analysis (IVA)

39 Identification of disease causing mutation

40 Alpers syndrome: progressive neurodegenerative dissorder POLG1 Alpers Huttenlocher FARS2 encoding enzyme to charge mt trna with Phe 19 patients: 6 had POLG1 mutations For this study:

41 Exome sequencing Mutations in PARS2 (Pro) and NARS2 (Asn)

42 NARS NARS2 P. horokoshii monomer-monomer interaction

43 PARS PARS2