Genotyping requirements for complex disease studies

Genotyping requirements for complex disease studies Grant Montgomery Molecular Epidemiology, Queensland Institute of Medical Research, Australia Queensland Institute of Medical Research

Outline Background Genetic markers Genome-wide association studies Genotyping technologies High quality genotypes and QC Interpreting the signals

The Challenge of Complex Disease Understanding the link between - DNA sequence (Genotype) Biology/Disease (Phenotype) ATTCGCATGGACC C A Environment

Complex Trait Model Marker Linkage Disequilibrium Gene 1 Association Individual environment Disease Phenotype Mode of inheritance Gene 2 Gene 3 Common environment Polygenic background

DNA polymorphisms Minisatellites Microsatellites >100,000 Many alleles, (CA) n, very informative, even, easily automated SNPs 10,054,521 (25 Jan 05) Most with 2 alleles (up to 4), not very informative, even, easily automated Detecting SNPs RFLPs Mass Spectrometry Bead Arrays A B C - G A - T A - T T - A G - C C - G T - A T - A T - A G - C T - A A - T C - G G - C A - T C - G A - T C - G A - T (CA) n G - C G - C C - G G - C A - T T - A A - T C - G G - C T - G C - G T - A A - T A - T A - T

Microsatellites or short tandem repeats (STRs) Detected by PCR Multiple alleles Widely used in linkage analysis and forensics

STR Profiles Jobling & Gill(2004)

The Positional Cloning Problem Chromosome Region Linkage to broad region Difficult to define more precise location Many possible genes Nature of the mutations/variants Deciding whether any variant is causal

Genetic architecture of complex genetic disorders Large Mendelian Disorders Highly Unusual Effect size Possible and detectable spectrum of common complex genetic disorders Very very Small Very very Rare Not detectable/ Not useful Allele Frequency Common

There have been few, if any, similar bursts of discovery in the history of medical research Hunter DJ and Kraft P, N Engl J Med 2007; 357:436-439. Stephen Channock

Single Nucleotide Polymorphisms (SNP) GGCTTCAGAATGGCC GGCTTCAAAATGGCC Single base changes Human SNPs = 10,054,521 - Validated SNPs 5,054,675 Frequency ~ 1 every 300 bp Can cause functional changes

Association studies to 2006 Candidate regions Some successes but generally: Poor replication Small sample sizes Conclusion effect sizes are smaller than expected selection of candidate regions

Candidate gene studies in endometriosis Reviewed >100 papers Results for > 60 genes Candidate genes chosen based on biology Mostly tested a few variants Small numbers of case and controls (<250 individuals) No associations widely replicated Montgomery et al, 2008

GWAS in humans Better understanding of patterns of human sequence variation 3,000,000,000 bases in human genome Advances in genotyping technology Sample collections of adequate size Genome-wide association scans Samples of interest ~10,000,000 positions commonly variant in Europeans 80% of these captured by typing ~500k test for evidence of association

Development of genome-wide association studies (GWAS) Risch & Merikangas, Science 1996 Human Genome Sequence 2003 ~10 million SNP polymorphisms (dbsnp) HapMap project 270 samples from 4 populations >3 million validated SNPs Linkage disequilibrium (LD) SNP chips Affymetrix (500k, 1M) Illumina (370k, 550k, 1M)

Haplotype Map of the Human Genome QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture. Goals: Define patterns of genetic variation across human genome Guide selection of SNPs efficiently to tag common variants Public release of all data (assays, genotypes) Phase I: 1.3 M markers in 269 people Phase II: +2.8 M markers in 270 people

Pairwise tagging A/T 1 G/A 2 G/C 3 T/C 4 G/C 5 A/C 6 Tags: A A T T G G A A G C G C T C C C G C G C A C C C SNP 1 SNP 3 SNP 6 3 in total Test for association: high r 2 high r 2 high r 2 After Carlson et al. (2004) AJHG 74:106 SNP 1 SNP 3 SNP 6 (Mark Daly HapMap Consortium)

Cost per genotype Cents (USD) Progress in genotyping technology 10 2 ABI TaqMan 10 1 0.1 ABI SNPlex Sequenom PyroSeq Illumina Golden Gate Affymetrix 10K Perlegen Affymetrix Illumina 100K/500K Infinium/Sentrix 1 10 10 2 10 3 10 4 10 5 10 6 SNPs No of 2001 2007 Stephen Channock

Genome wide association >500k SNPs Hirschhorn & Daly Nat. Genet. Rev. 6: 95, 2005 >1-30k SNPs Replication Replication Replication NCI-NHGRI Working Group on Replication Nature 447: 655, 2007

SNP Genotyping Platforms Throughput (SNPs Per Assay) 1 35 >7500 TaqMan 7900 Illumina BeadStation Sequenom MassARRAY Cost Per Assay Flexibility in Project Design

Sequenom MassARRAY Medium throughput Primer extension Detection by Mass-Spectrometry 30-35 assays per sample 384 many 1000s samples

Sequenom SNP Platform Multiple or Single Base Primer Extension Chemistry Allele 1 Allele 2 EXTEND Primer (23-mer) EXTEND Primer (23-mer) CTA GTA extended Primer (24-mer) +Enzyme +ddgtp/ddatp +dctp/dttp extended Primer (25-mer) CTA GTA 21 22 23 24 25 26 27 28 4-Level specificity: 1: PCR two primers 2: Extension primer hybridization 3: Primer extension traps the event 4: Mass resolution expected masses Unambiguous high confidence results

Example of 25 Plex Assay using iplex on Compact MassARRAY Lowers the cost per genotype to under USD 0.06

Illumina BeadStation Linkage Mapping Custom Genotyping 1536 60,0000 SNPs Genome Wide Association Gene Expression

Whole Genome Genotyping: Infinium

Human610-QUAD Bead Chip Coverage CEU CHB YRI U.S. (residents with ancestry from N and W Europe collected in 1980 by the Centre d'etude du Polymorphisme Humain, CEPH) Japan, China Nigeria (Yoruba)

Custom SNP Set Custom Genotyping 96 Well Format First custom SNP set 1536 SNPs 1482 tag SNPs 225 coding SNPs 39 double tag SNPs in larger SNP bins

Illumina BeadStation 500 1 million markers across all chromosomes Comparison in four MZ twin pairs Mean error rate 6 SNPs in 1.06 million calls

Producing High Quality Genotypes Minimum Finished Genotypes (>98.5%) Quality of DNA Measure concentrations Dispense in large volumes Assay Design Repeat sequences SNPs in primer sequences Quality of Assays Check cluster plots Test for Hardy-Weinberg equilibrium Analysis of SNP data is particularly sensitive to assay problems Genotype failures are not random Heterozygous individuals fail most often All SNP typing platforms Include controls and check error rates Check controls Repeat assays

Producing High Quality Genotypes Sample collection Sample storage and tracking Laboratory Technique Mixed samples Data interpretation True mixtures

Standard Blood Collection and Processing Samples are collected in the following tubes: 2 x EDTA 1 x SERUM 1 x ACD 1 x PAX 1 x BUCCAL MNC Processing Buccal Extraction 4 x Red Blood Cells 4 x Plasma 4 x Serum The 2 x EDTA & 1 x SERUM tubes are centrifuged at 3000rpm for 10mins and then the fractions are collected. All fractions & 1 x Buffy Coat are stored in the -80 o C freezers Stored in Freezer for later RNA work 2 x Buffy Coats 1 x Buffy Coat Extraction

DNA Quantitation Stock DNA 400ul 1 x TE Stock DNA 1:5 Dilution (100ul stock + 400ul 1 x TE) 1:5 Dilutions 1:100 Dilution (5ul 1:5 + 495ul 1 x TE) 96 deep well plate Based on Fluoroskan Picogreen results, the 1:5 added to plates dilution is and standards, modified to fluorescence 50ng/ul by detected by addition of more Ascent buffer or stock + Fluoroskan 50ul of 1:100 transferred to Black OptiPlates in duplicate Known DNA Standards New DNA dilution 50ng/ul 500ul+ + Remaining Stock DNA 300ul Expensive but costs offset by savings in better quality genotypes, less DNA used and reduced reaction volumes

Multiplex Assays Must be Tested Poor Markers Redesigned

Genotyping artifacts Allele 1 Allele 2 EXTEND Primer (23-mer) EXTEND Primer (23-mer) CTA GTA extended Primer (24-mer) +Enzyme +ddgtp/ddatp +dctp/dttp extended Primer (25-mer) CTA GTA Base change (SNP) under the primer site 21 22 23 24 25 26 27 28 4-Level specificity: 1: PCR two primers 2: Extension primer hybridization 3: Primer extension traps the event 4: Mass resolution expected masses Unambiguous high confidence results

Producing High Quality Genotypes Null Allele?

Producing High Quality Genotypes Plate variation

Genotype Quality Control Control Group 1 Control Group 2 Case Group

CNVs in MZ Twins CNV Analysis of one twin pair showing a 1.6 Mb deletion on chromosome 2 Bruder et al. (2008) AJHG 82, 1 9,

DNA mixtures Mixed samples Blood transfusions Chimeras rare cases share cells with co-twin in utero blood chimeras true chimeras

DNA Mixtures Science 308: 1864 24 June 2005

The 'semi-identical' twins are the result of two sperm cells fusing with a single egg a previously unreported way for twins to come about. The twins are chimaeras, meaning that their cells are not genetically uniform. Each sperm has contributed genes to each child. news@nature.com

Allele sharing in chimeric twins Golden Gate 6008 SNPs Heterozygous markers Father Mother 779 675 Shared alleles 52.1% 100%

Possible Mechanisms (A) the three gamete model immediate cleavage secondary to parthenogenetic activation of the egg followed by fertilization of the identical cells formed, by two different sperm containing different sex chromosomes. (B) dispermic fertilization of an ovum followed by the postzygotic diploidization of triploids concept as postulated by Golubovsky (2003).

Wellcome Trust Sanger Institute SNP QC

Genotype Quality Control All SNPs which exhibit phenotype association(s) should have their hybridization intensity cluster plots manually examined for potential biases or failures. This check can halve the false positive rate and reduce the cost of a replication experiment. Each plot is inspected for: 1. Over-dispersion of the genotype clusters or overlap 2. Biased no calling 3. Erroneous genotype assignment A SNP failing any of the above QC criteria is excluded from further analyses. WTSI QC Pipeline

Tag SNPs probably not the casual variants SNP association with disease allele marker SNP disease allele GENE marker SNP marker SNP marker SNP Linkage and LD assume markers have indirect association with the trait Large SNP collections and cheaper genotyping may allow testing for direct, physiologically relevant associations with trait

Human OCA2 and blue/brown eye colour A three-snp haplotype in the first intron of OCA2 explains most human eye color variation Zhu et al., Twins Res 7:197-210 (2004) Duffy et al., AJHG Feb, 2007

Single variant upstream of OCA2 determines eye colour rs12913832 and eye colour C/C T/C T/T 0 0.2 0.4 0.6 0.8 1 Eye colour frequencies 21 Kb OCA2 HERC2 Sturm et al., AJHG 82: 424-431, 2008

A single SNP within intron 86 of HERC2 determines Blue-Brown eye colour Sturm et al., AJHG 82: 424-431, 2008 rs12913832 C = Blue rs12913832 T = Brown HLTF Sulem et al, Nat Genet 39: 1443, 2007 Kayser et al, AJHG 82: 411-423, 2008 Eiberg et al, Hum Genet 123: 177-187, 2008

Block III Log(p) values for Illumina SNPs located in a 2 Mb region centred around the 122 Kb block III (marked by solid vertical lines)

High throughput sequencing Whole genome and targeted resequencing Discovery of rare variants Additional SNP variation Copy number variations and chromosomal rearrangements

DNA Requirements Amount (ng) 1 1 30 600k Some sequencing applications might require 20 g

DNA Requirements Amount ( g) 1 1 30 600k Genome Sequence Some sequencing applications might require >20 g

Conclusions Rapid advances in genome technologies Accurate high throughput SNP typing platforms Discovery of many genes/variants contributing to risk for common diseases Errors and artefacts still occur Careful QC from sample collection to data analyses Typical data set (4000 individuals 600k SNPs) 2.4 x 10 9 genotypes A good quality data set come thanks to good lab people