High-throughput genotyping with microarrays

Size: px
Start display at page:

Download "High-throughput genotyping with microarrays"

Transcription

1 High-throughput genotyping with microarrays Richard Bourgon 20 June 2008 EBI is an outstation of the European Molecular Biology Laboratory

2 Overview! Haploid genotyping in S. cerevisiae! Early work: single feature polymorphisms! Meiotic recombination! High-density tiling arrays! Genotyping by multivariate semi-supervised clustering! Human SNP genotyping! Assay and array design! Observed behavior of probe sets! Correcting for PCR artifacts! Genotyping by supervised empirical Bayes classification Slide 2

3 Haploid genotyping in S. cerevisiae!" Slide 3

4 Genotyping single feature polymorphisms! Hybridization efficiency depends on the number and position of mismatches.! Differential hybridization provides a means of detecting polymorphisms, even when only the reference genome sequence is known.! Winzeler et al., Science 281(5380), 1998.! Brem et al., Science 296(5568), 2002.! Steinmetz et al., Nature 416(6878), 2002.! Borevitz et al., Genome Research 13(3), Slide 4

5 Single-probe methods: polymorphism detection! Winzeler et al. (and others): ANOVA testing µ 1 = µ 1. Equivalent to a twosample t-test assuming common variance.! Borevitz et al.: a moderated t-test using the SAM adjustment.! Brem et al.: a moderated t-test. Then, cluster all data (parental and segregant) ignoring genotype labels for parents. Discard SFPs for which clusters don t separate the parental data. No sequence variation Polymorphism Variant Reference t = Variant Reference t = Log intensity Log intensity Slide 5

6 Single-probe methods: segregant genotyping Polymorphism! In crosses, offspring genotype is unknown.! ANOVA and t-test methods use the estimated posterior probability of class membership, with a uniform prior on the classes: Variant Reference t = Log intensity ˆP(genotype g x) = ˆp g (x) ˆp 1 (x) + ˆp 2 (x).! Brem et al. augment this: ˆp 1 and ˆp 2 are estimated from clustered data with offspring and parents combined. P(Variant log intensity) Offspring log intensity Slide 6

7 Meiotic recombination!" Slide 7

8 Meiosis! Two rounds of cell division.! Reduction from diploid to haploid gametes, required for sexual reproduction.! Recombination! Provides physical contact between homologs.! Increases genetic diversity, and is the primary determinant of haplotype structure: withinspecies genetic similarity and difference. Paternal homolog Maternal homolog Prophase Metaphase Anaphase Pairing of duplicated homologs Bivalents line up on the spindle Meioitic division I Meioitic division II From Molecular Biology of the Cell, Fourth Edition. Slide 8

9 Double-strand break repair! Recombination initiates with a double-strand break in one DNA molecule. 5! 3! 3! 5! DSB formation 5! to 3! resection! Only two DNA molecules (homologs) are shown here. Heteroduplex Strand-invasion, D-loop formation, DNA synthesis (dotted line) Second end capture, synthesis, ligation Strand-displacement! 5!! 3! Holliday junction resolution! 3! 5! Strand annealing, synthesis, ligation CO: NCO: Mismatch repair Crossover point Mismatch repair Crossover point Gene conversion tract Gene conversion tract Slide 9

10 Saccharomyces cerevisiae microarray data! Two strains: S96 and YJM789.! Tiling microarrays:!! 6.5 M 5" features, tiling non-repetitive S96 every 4 bases.!! 4% of probes are specific to YJM789 sequence.! Markers: Insertions Deletions SNPs Multiple 1,433 1,503 47,929 1,197! Data! 25 parental genomic hybridizations.! 208 wildtype offspring hybes.! 20 msh4 offspring hybes.! 20 mms4 offspring hybes. Slide 10

11 Slide 11 Polymorphism density

12 Meiotic recombination assay overview Haploid spores High resolution genotyping S288c specific probe S288c AGCACTGTAACCTATCGCTTCCTCA Diploid hybrid YJM789 specific probe AGCACTGTAACCGATCGCTTCCTCA TGACATTGGATAGCGAAGG Meiosis YJM789 AGCACTGTAACCTATCGCTTCCTCA S288c Chr. YJM789 Chr. recombinant Chr. Slide 12 AGCACTGTAACCGATCGCTTCCTCA TGACATTGGCTAGCGAAGG

13 Genotyping!" Slide 13

14 Tiling arrays, probe sets, and markers 6: CTTCACTATTTGTACAGATCGCAAT! 5: CTAACTTCACTATTTGTACAGATCG! 4: GGCCCTAACTTCACTATTTGTACAG! 2: GACTGGCCCTAACTTCACTATTTGT! 1: GGAGGACTGGCCCTAACTTCACTAT! S96: CCTCCTGACCGGGATTGAAGTGATAAACATGTCTAGCGTTA! YJM789: CCTCCTGACCGGGATTGAACTGATAAACATGTCTAGCGTTA! 3: GACTGGCCCTAACTTGACTATTTGT!! Probe set: group of probes which each exactly map to a unique location, and which interrogate a common polymorphism.! Marker: one or more polymorphisms interrogated by the same probe set. Slide 14

15 Slide 15 Marginal probe behavior

16 Semi-supervised genotyping (ssg)! Multivariate Gaussian mixture model, with a latent class variable.! Fit by EM algorithm. Chromosome I, poly IDs 42 & 43, d = 3 S96 YJM789 Segregant S96 YJM789 Segregant PC Supervised PC Semi-supervised PC1 PC1 Slide 16

17 Slide 17 Chromosome-level unfiltered ssg results

18 Filtering examples A. B. C. PC2! PC2!1.5!1.0! PC2!2! S288c YJM789 Segregant!4! PC1! PC1!8!6!4! PC1 Slide 18

19 Slide 19 Possible cross-hybridization

20 Filtering summary! Array level! Excess genotype switching.! Large RMS residual (Mahalanobis) to assigned class.! Probe set level! High estimated misclassification rate.! Aberrant cluster behavior.! Very unusual genotype ratio.! Call level! Intermediate posterior probability of class membership.! Large residual to assigned class. Slide 20

21 Slide 21 Chromosome-level results

22 Slide 22 Chromosome-level results

23 Slide 23 Chromosome-level results

24 Slide 24 Recombination event inference for one tetrad

25 Chromosome length and events per meiosis Recombination events per meiosis Average CO/tetrad events = size Average NCO/tetrad events = size Chromosome size Chromosome size! An average of 91 COs and 46 observed NCOs per meiosis.! 30% of COs fall between markers: correct 46 NCOs to!66 NCOs per meiosis.! Up to 1% of the genome of each meiotic product falls within a recombination interval and is thus subject to gene conversion in a single meiosis. Slide 25

26 Event size and marker resolution Chromosome VI, tetrad wt_2 Chromosome II, tetrad wt_ Position/index Position/index Slide 26

27 Event size and marker resolution Chromosome VI, tetrad wt_2! Holliday junction resolution Second end capture, synthesis, ligation!! Chromosome II, tetrad wt_2 5! 3! 3! 5! Mismatch repair Crossover point Crossover point Gene conversion tract Position/index Position/index Slide 27

28 Recombination event rates Non-crossover Crossover Marker Count CTGACCGGGAT CTGACGGGGAT GAAGTGATAA---TGT GAACTGATAAACATGT! Traditional corrections (e.g., Haldane) use recombination fraction, and adjust for unseen crossovers which occur between widely-spaced markers.! High-density marker data invert the traditional relationship, placing multiple markers within most recombination events both crossover (CO) and noncrossover (NCO). Slide 28

29 CO and NCO rates along the chromosome! Centromeres are uniformly cold, as boundaries of large insertions and a large transversion.! Many regions show elevated CO or NCO rate, but total event counts are too low for locusspecific testing.! Global CO/NCO bias testing: an excess of skewed ratios from IMIs corresponding to!100 kb, or 1.4% of the recombinatorially active fraction (p <.0005). 15 V VI VII VIII kb 500 kb 1000 kb Slide 29

30 Correspondence between DSBs and event intervals! After adjusting for marker spacing, event counts correlate well with recent ChIP-chip experiments for the Spo11 maps. (C Buhler et al., PLoS Biology 5:e324, 2007.) Event count HIS4 LEU2!CEN3 ARE1/IMG1 THR DSB ratio Chromosome III position (kb) Slide 30

31 200 rf kb Counts CO NCO kb 0.0 Slide 31

32 rf 0.5 rf kb kb Counts CO NCO kb rf kb Slide 32

33 Slide 33

34 Human SNP genotyping!" Slide 34

35 Single nucleotide polymorphisms! SNPs with a minor allele frequency > 1% occur every few hundred bases, on average along the human genome.! SNPs (as well as CNVs, etc.) contribute to variability in human disease development, response to pathogens or drugs, and other important biological characteristics. Slide 35

36 Affymetrix whole genome sampling assay! Digest total genomic DNA with a restriction enzyme.! Ligate generic adapters and amply by PCR, retaining only fragments 250 bp to 2 kb in size (50-fold complexity reduction). Xba Xba Xba Xba Digestion Single Primer Amplification Adapter Ligation Fragmentation and Labeling Hyb & Scan on Standard Hardware Slide 36

37 Probe quartets A / G TAGCCATCGGTA N GTACTCAATGAT PM MM 0 Allele 0 Allele A A PM 0 Allele B MM 0 Allele B ATCGGTAGCCAT T ATCGGTAGCCAT A ATCGGTAGCCAT C ATCGGTAGCCAT G CATGAGTTACTA CATGAGTTACTA CATGAGTTACTA CATGAGTTACTA A / G CATCGGTA N GTA C TCAATGATCAGC PM +4 Allele A MM +4 Allele A PM +4 Allele MM +4 Allele B B GTAGCCAT T GTAGCCAT T GTAGCCAT C GTAGCCAT C CAT G AGTTACTAGTCG CAT C AGTTACTAGTCG CAT G AGTTACTAGTCG CAT C AGTTACTAGTCG PM A Anti-sense Sense MM A PM B MM B PM A MM A PM B MM B Slide 37

38 Probe set behavior! CRLMM, B Carvalho et al., Biostatistics 8(2): , 2007.! Preprocess, with quantile normalization relative to a HapMap reference array.! Take a single SNP, s. For both strands and both alleles, use RMA to summarize probes: # A+, # A-, # B+, # B-. Slide 38

39 Important for accurate results! For each strand compute a log ratio: M + = log # A+ /# B+, M - = log # A- /# B-.! Clear evidence of PCR bias, with respect to probe sequence and fragment length. Effects are large and change from lab to lab. SNP bases also matter.! Statistically adjust log intensities and log-ratios to reduce such effects. Slide 39

40 Genotyping! Using the HapMap (CEPH) trio data set 30 trios for which (i) Affymetrix 100K data are available, and (ii) true genotype is known empirical Bayes Gaussian models are fit to the three observed genotype clusters.! SNPs are filtered on signal-tonoise ratio across classes.! Genotype new subjects based on likelihood ratios. Slide 40

41 Slide 41 Genotyping accuracy vs. no-call rate

42 Summary! Differential hybridization efficiency provides a good basis for genotyping, but! Single probes are often badly behaved.! Signal-to-noise level for groups of probes varies strongly. Not all probe sets are serviceable.! Cross-hybridization is a real issue, but can be detected.! Efficiency of PCR varies with fragment length and sequence composition. Slide 42

43 Thanks!! EMBL, Heidelberg! Eugenio Mancera! Lars Steinmetz! Julien Gagneur! Zhenyu Xu! CRLMM! Benilton Carvalho! Henrick Bengtsson! Terry Speed! Rafael Irizarry! DKFZ! David Zhang! EBI! Wolfgang Huber! Alessandro Brozzi! Funding! National Institutes of Health (NIH)! Deutsche Forschungsgemeinschaft! Human Frontier Science Program Slide 43