Approaches for the Assessment of Chromosomal Alterations using Copy Number and Genotype Estimates

Size: px
Start display at page:

Download "Approaches for the Assessment of Chromosomal Alterations using Copy Number and Genotype Estimates"

Transcription

1 Approaches for the Assessment of Chromosomal Alterations using Copy Number and Genotype Estimates Department of Biostatistics Johns Hopkins Bloomberg School of Public Health September 20, 2007

2 Karyotypes

3 Trisomy

4 Karyotypes General Cytogenetics Information

5 FISH Courtesy of the Pevsner Laboratory

6 SNP chip data

7 Deletion

8 Amplification

9 Uniparental Isodisomy

10 Cancer samples

11 Mosaicism

12 Mosaicism

13 Estimation 1 By SNP: Estimate genotype and copy number for each SNP. 2 Within a sample: Borrow strength between SNPs to infer regions of LOH and copy number changes. 3 Between samples: Comparison between normal and disease populations to find chromosomal alterations associated with disease.

14 Affymetrix genotype estimation Relative allele signal (RAS) Liu et.al. (2003), Algorithms for large scale..., Bioinformatics 19: Dynamic model based algorithms (DM) Di et.al. (2005), Dynamic model based algorithms..., Bioinformatics 21: Robustly fitted linear models (RLMM) Rabbee and Speed (2006), A genotype calling algorithm..., Bioinformatics 19: BRLMM Affymetrix white paper ( CRLMM Carvalho et.al. (2007), Exploration, normalization..., Biostatistics 8: SNiPer-HD Hua et.al. (2007), Improved genotype calling..., Bioinformatics 23:

15 Affymetrix copy number estimation Copy number analysis tool (CNAT) Copy number analysis with regression and tree (CARAT) Huang et.al. (2006), CARAT: A novel method..., BMC Bioinformatics 7: 83. Probe-level allele specific quantitation (PLASQ) LaFramboise et.al. (2007), PLASQ: A generalized..., Biostatistics 8: CN-CRLMM Wang et.al. (2007), Estimating genome-wide copy..., RECOMB proceedings (to appear).

16 Statistical analysis of SNPchip data For the analysis of SNP chip data, i.e. to infer LOH regions and to estimate copy numbers changes, the dchip software and methods are the ones most widely used in the literature. I think that this is because a) it works pretty well, and b) the authors have beaten others to the line. Copy number analysis: Zhao et.al. (2004), An integrated view of copy..., Cancer Research 64: Inferring LOH: Lin et.al. (2004), dchipsnp: significance curve... Bioinformatics 20: Beroukhim et.al. (2006), Inferring loss-of-heterozygosity..., PLOS Comp Biol May 2 (5).

17 SNPchip S4 classes and methods

18 The structure of the data we observe At each SNP, we observe a noisy measure of the true copy number and genotype (and possibly also measures of confidence in those estimates).

19 Another HMM...? Novel (and we believe, important) HMM features: 1 Model the observation sequence of genotype calls and copy number jointly (Vanilla) 2 Integrate confidence estimates of the genotype calls and copy number estimates (ICE) QuantiSNP also models genotype calls and copy number jointly! Colella et. al. (2007), QuantiSNP: an objective Bayes hidden-markov model..., Nucleic Acids Res 35(6):

20 The Vanilla HMM components Observations ĈN and ĜT Hidden states Initial state probability distribution Transition probabilities Emission probabilities

21 Hidden states

22 Transition probabilities Following suggestions in the literature, we model the transition probabilities as a function of the distance d between SNPs. Specifically, let θ(d) 1 e 2d denote the probability that SNP i is not informative (I c ) for SNP at i + 1. For example: τ =P { (d) = P { = P { P i+1 i, d } i+1, I i, d } + P i+1, I c i, d i+1 I, i, d } P { I i, d } + i+1 I c, i, d P I c i, d = P { } θ(d).

23 Emission probabilities We assume conditional independence between copy number estimates and the genotype calls. For example: f( c CN, c GT ) = f( c CN ) f( c GT ) n o n o = f ccn ց f cgt = β ց n ccn o β n cgt o.

24 More information The confidence in genotype calls can differ substantially between SNPs! 4 Concordant 2 AA AB BB Discordant called BB (AB) Sense Antisense

25 Integrating confidence estimates for genotype calls Let S cgt be the confidence score for the genotype estimate. We can estimate from Hapmap the following densities: n o n o n f SĤOM ĤOM, HOM, f SĤOM ĤOM, HET, f S HET d HET, d o n HOM, f S HET d HET, d o HET. Note: n o f SĤOM ĤOM, n f S HET d HET, d o n o f SĤOM ĤOM, HOM n f S HET d HET, d o HOM.

26 Emission Probabilities - Loss Recall that f( c CN, c GT ) = f( c CN ) f( c GT ) n o n o = f ccn ց f cgt = β ց n ccn o β n cgt o. If the state for a particular SNP is Loss, we have β n cgt, S cgt o = f n o n cgt f S cgt GT, c o.

27 Emission Probabilities - Retention For retention, the true genotype can be HET or HOM: β n cgt, Sd GT o n o n = f cgt f S GT d GT, c o n o n = f cgt f S GT d, HOM GT, c o n + f S d GT, HET GT, c o n o n = f cgt f S GT d HOM, GT, c o n f HOM GT, c o n + f S GT d HET, GT, c o n f HET GT, c o n o n = f cgt f S GT d HOM, GT c o n f HOM GT, c o n + f S GT d HET, GT c o n f HET GT, c o

28 Copy number for chromosome A D B C E

29 Copy number for chromosome A D B C E

30 Copy number and genotype calls for chromosome A D B C E

31 Vanilla and ICE HMMs for genotype and copy number Deletion Normal LOH Amplification A D B C E 2 1 Van ICE A D B E Van ICE Mb

32 A HapMap sample Deletion Normal LOH Amplification 1 Van ICE Van ICE Mb

33 Many HapMap samples

34 SNP Trio

35 De novo deletion

36 Homozygous and hemizygous deletions GF GM M A B S

37 Homozygous and hemizygous deletions GF GM M A B S Grandfather Grandmother Mother Aunt Brother Sister AB 0.01 BB 0.05 AB AB AB 0.06 BB AA 0.27 NC AA AA AB 0.12 BB AB 0.15 NC AA AA AB BB AB NC AA AA AA 0.30 AA BB 0.20 NC NC BB BB 0.22 BB AB 0.03 NC BB BB AB 0.09 AA AB NC BB BB AB AA BB 0.01 NC BB 0.04 BB BB 0.14 NC AB 0.17 NC AA AA AA AA AB 0.01 NC AA AA AA 0.16 AA AB NC BB BB AB 0.13 AA AB 0.01 NC BB BB AB AA BB 0.16 NC BB BB BB 0.10 BB AB 0.06 NC AA AA AB 0.07 BB AA 0.17 NC AA AA AB BB BB 0.02 NC BB BB AB 0.13 AA AA 0.21 NC AA AA AB 0.20 BB BB 0.05 BB 0.11 BB 0.15 BB 0.29 AB 0.13 AB -0.01

38 Homozygous and hemizygous deletions GF GM M A B S Grandfather Grandmother Mother Aunt Brother Sister AB 0.01 BB 0.05 AB AB AB 0.06 BB AA 0.27 NC AA AA AB 0.12 BB AB 0.15 NC AA AA AB BB AB NC AA AA AA 0.30 AA BB 0.20 NC NC BB BB 0.22 BB AB 0.03 NC BB BB AB 0.09 AA AB NC BB BB AB AA BB 0.01 NC BB 0.04 BB BB 0.14 NC AB 0.17 NC AA AA AA AA AB 0.01 NC AA AA AA 0.16 AA AB NC BB BB AB 0.13 AA AB 0.01 NC BB BB AB AA BB 0.16 NC BB BB BB 0.10 BB AB 0.06 NC AA AA AB 0.07 BB AA 0.17 NC AA AA AB BB BB 0.02 NC BB BB AB 0.13 AA AA 0.21 NC AA AA AB 0.20 BB BB 0.05 BB 0.11 BB 0.15 BB 0.29 AB 0.13 AB -0.01

39 Homozygous and hemizygous deletions GF GM M A B S Grandfather Grandmother Mother Aunt Brother Sister AB 0.01 BB 0.05 AB AB AB 0.06 BB AA 0.27 NC AA AA AB 0.12 BB AB 0.15 NC AA AA AB BB AB NC AA AA AA 0.30 AA BB 0.20 NC NC BB BB 0.22 BB AB 0.03 NC BB BB AB 0.09 AA AB NC BB BB AB AA BB 0.01 NC BB 0.04 BB BB 0.14 NC AB 0.17 NC AA AA AA AA AB 0.01 NC AA AA AA 0.16 AA AB NC BB BB AB 0.13 AA AB 0.01 NC BB BB AB AA BB 0.16 NC BB BB BB 0.10 BB AB 0.06 NC AA AA AB 0.07 BB AA 0.17 NC AA AA AB BB BB 0.02 NC BB BB AB 0.13 AA AA 0.21 NC AA AA AB 0.20 BB BB 0.05 BB 0.11 BB 0.15 BB 0.29 AB 0.13 AB -0.01

40 Acknowledgments Rob Scharpf Giovanni Parmigiani Rafael Irizarry Benilton Carvalho, Wenyi Wang Jonathan Pevsner Nate Miller, Eli Roberson, Jason Ting

41 References Carvalho B, Bengtsson H, Speed TP, Irizarry RA (2007) Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. Biostatistics, 8(2): Scharpf RB, Ting JC, Pevsner J, Ruczinski I (2007). SNPchip: R classes and methods for SNP array data. Bioinformatics, 23(5): Scharpf RB, Parmigiani G, Pevsner J, Ruczinski I (2007). Hidden Markov models for the assessment of chromosomal alterations using high-throughput SNP arrays (under revision). Ting JC, Ye Y, Thomas GH, Ruczinski I, Pevsner J (2006). Analysis and visualization of chromosomal abnormalities in SNP data with SNPscan. BMC Bioinformatics, 7(1):25. Ting JC, Roberson ED, Miller N, et al, Ruczinski I, Thomas GH, Pevsner J (2007). Visualization of uniparental inheritance, Mendelian inconsistencies, deletions and parent of origin effects in single nucleotide polymorphism trio data with SNPtrio. Human Mutation, (in press). Wang W, Caravalho B, Miller N, Pevsner J, Chakravarti A, Irizarry RA (2006) Estimating genome-wide copy number using allele specific mixture models. JHU Biostatistics Working papers, #122.

42 iruczins/