UHT Sequencing Course Large-scale genotyping. Christian Iseli January 2009

Size: px
Start display at page:

Download "UHT Sequencing Course Large-scale genotyping. Christian Iseli January 2009"

Transcription

1 UHT Sequencing Course Large-scale genotyping Christian Iseli January 2009

2 Overview Introduction Examples Base calling method and parameters Reads filtering Reads classification Detailed alignment Alignments analysis Output generation

3 Introduction Basic problem: distinguish polymorphism from sequencing error Use quality measures Use redundancy Use knowledge about data source

4 Examples Retinitis pigmentosa Hypertrophic cardiomiopathy HSA 21q genotyping

5 Retinitis pigmentosa Inherited eye disease Linkage analysis PRPF31 mutation Incomplete penetrance Attempt sequencing

6 PRPF31 example c c>g 13 14

7 PRPF31 example

8 PRPF31 example, zoom

9 PRPF31 example, MFA

10 Examples Retinitis pigmentosa Hypertrophic cardiomiopathy HSA 21q genotyping

11 Hypertrophic cardiomiopathy Small collection of known genes PCR amplify gene pieces Sequence

12 Small deletion

13 Examples Retinitis pigmentosa Hypertrophic cardiomiopathy HSA 21q genotyping

14 Exome sequencing Extract selected genomic parts Sequence collected pieces

15 Coverage on HsA 21q

16 Coverage detail HsA 21q

17 HsA 21q HAPMAP NA12782

18 Base calling Rolexa FastQ...

19 Reads filtering Entropy Quality values (Position)

20 Filtering example Rolexa base calling Filter reads for length and ambiguity ACGTU -> 1 KMRSWY -> 2 BDHV -> 3 N -> 4 Minimum length 20 Maximum ambiguity 81

21 Read classification Use fetchgwi against whole genome Single exact matches -> U (unique) Multiple exact matches -> R (repeat) No exact match -> M (missed)

22 Detailed alignment Use M reads Split region of interest in chunks (eg 300 bp + 40 bp overlap) Find reads with identical 12-mer Global alignment of reads vs chunks Filter alignments, retain good set Eg: maximum 3 mismatches

23 Alignment analysis Map retained reads to full genome Remove set with better maps outside region of interest

24 Practical alignment analysis 1 12-mers U R M

25 Practical alignment analysis 2 12-mers U R M

26 Output generation Create multiple sequence alignment Prepare text output in column format Call SNPs (alleles, coverage, etc.)

27 Results in CSV files

28 Detailed view in UCSC

29 Results in MFA

30 Script srmap Needs fetch.conf, input chunk and genomic coordinates Produces MFA and CSV output

31 Script preparejobs Needs genomic coordinates Prepares scripts to process each chunk using srmap

32 Script local2genomic Needs CSV file produced by srmap Adds genomic coordinates

33 Script collatecsv Needs CSV file produced by local2genomic Merges chunks back together

34 Script matchgenotype Needs CSV file produced by srmap, local2genomic, or collatecsv Needs genotype file, eg genotypes_chrmt_yri_r24_nr.b36_fwd.txt.gz Compares detected SNPs with reference and produces CSV output

35 Exercise data source ftp://ftp.ncbi.nih.gov:21/pub/tracedb/shortread/sra000271/fastq Locally in UHTS_SNP subdirectory of student accounts

36 Exercise 1 Analyze Illumina reads from NA18507 Confirm HapMap genotype for the mitochondrial genome Choose subsets of the reads and see how coverage and SNPs are affected (confirm other genomic regions of interest)

37 Exercise 2 Analyze paired Illumina reads from NA18507 Look at the mitochondrial DNA and explain the apparent gap near coordinates 1-120

38 Exercise 3 Analyze paired Illumina reads from NA18507 Can you confirm homozygous 1Kb deletion on chromosome 20 at 61 Mb?

39 Exercise 4 Analyze paired Illumina reads from NA18507 Can you confirm a complex re-arrangement on chromosome 5 What do you expect to see in the pairs?