Supplementary Methods Illumina Genome-Wide Genotyping Single SNP and Microsatellite Genotyping. Supplementary Table 4a Supplementary Table 4b

Size: px
Start display at page:

Download "Supplementary Methods Illumina Genome-Wide Genotyping Single SNP and Microsatellite Genotyping. Supplementary Table 4a Supplementary Table 4b"

Transcription

1 Supplementary Methods Illumina Genome-Wide Genotyping All Icelandic case- and control-samples were assayed with the Infinium HumanHap300 SNP chips (Illumina, SanDiego, CA, USA), containing 317,503 haplotype tagging SNPs derived from phase I of the International HapMap project. Of the SNPs assayed on the chip, 162 SNPs generated no genotypes, and an additional 178 SNPs had yield lower than 90%. Forty-eight SNPs were monomorphic and 107 others nearly monomorphic (i.e. the minor allele frequency in the combined cohort of patients and controls was less than 0.001). An additional 475 SNPs showed very significant distortion from Hardy-Weinberg equilibrium in the controls (p < 1x10-10 ). Lastly, a few markers (n=18) were determined to have genotyping problems after investigation of particular regions and possible signals in several different on-going genome-wide association studies inhouse. Thus, the final analyses presented in the text utilizes 316,515 SNPs. Any samples with a call rate below 98% were excluded from the analysis. Single SNP and Microsatellite Genotyping. Single SNP genotyping for all samples was carried out at decode genetics in Reykjavik, Iceland applying the same platform to all populations studied. SNP genotyping was carried out by the Centaurus (Nanogen) platform 1 (see Supplementary Table 4a for assays used for genotyping of rs and the 14 SNPs of HapC). The quality of each Centaurus SNP assay was evaluated by genotyping each assay in the CEU and/or YRI HapMap samples and comparing the results with the HapMap data. Assays with >1.5% mismatch rate were not used and a linkage disequilibrium (LD) test was used for markers known to be in LD. For key markers we re-genotyped more than 10% of samples and observed a mismatch rate lower than 0.5%. The following microsatellite (MS) markers were typed on the US African American study group in order to estimate the genetic ancestry: D1S2630, D1S2847, D1S466, D1S493, D2S166, D3S1583, D3S4011, D3S4559, D4S2460, D4S3014, D5S1967, DG5S802 (see Supplementary Table 4b), D6S1037, D8S1719, D8S1746, D9S1777, D9S1839, D9S2168, D10S1698, D11S1321, D11S4206, D12S1723, D13S152, D14S588, D17S1799, D17S745, D18S464, D19S113, D20S878 and 1

2 D22S1172 (see Amundadottir et al. 2 for a detailed description of the MS genotyping methods and quality control). Association Analysis. For both single-marker and haplotype analyses, the main results were calculated assuming a multiplicative model for risk, i.e., that the risks of the two alleles/haplotypes a person carries multiply. For example, if RR is the risk of A relative to a, then the risk of a person homozygote AA will be RR times that of a heterozygote Aa and RR 2 times that of a homozygote aa. The multiplicative model has a nice property that simplifies analysis and computations haplotypes are independent, i.e., in Hardy-Weinberg equilibrium, within the affected population as well as within the control population. As a consequence, allele/haplotype counts of the affecteds and controls each have multinomial distributions, but with different haplotype frequencies under the alternative hypothesis. Specifically, for two haplotypes h i and h j, risk(h i )/risk(h j ) = (f i /p i )/(f j /p j ), where f and p denote frequencies in the affected population and in the control population, respectively. While there is some power loss if the true model is not multiplicative, the loss tends to be mild except for extreme cases. Most importantly, p-values are always valid since they are computed with respect to the null hypothesis. In general, haplotype frequencies are estimated by maximum likelihood and tests of differences between cases and controls are performed using a generalized likelihood ratio test 3. As demonstrated below, this method could also be used in situations where single marker association is of interest, but there are some missing genotypes for the marker of interest and genotypes of another marker are used to provide some partial information. Our haplotype analysis program called NEMO, which stands for NEsted MOdels, was used to calculate all the haplotype results presented. To handle uncertainties with phase and missing genotypes, it is emphasized that we do not use a common two-step approach to association tests, where haplotype counts are first estimated, possibly with the use of the EM algorithm, and then tests are performed treating the estimated counts as though they are true counts, a method that can sometimes be problematic and may require randomization to properly evaluate statistical significance. Instead, with NEMO, maximum likelihood estimates, likelihood ratios and p-values are computed directly for the observed data, and hence 2

3 the loss of information due to uncertainty in phase and missing genotypes is automatically captured by the likelihood ratios. Haplotype Groups and Nested Models. When investigating haplotypes constructed from multiple markers, apart from looking at each haplotype individually, meaningful summaries often require more complex risk models. A series of bipartitions, that each assigns risk 1 to one part of the haplotype space and a risk different from 1 to the other part, form a model by assigning to each haplotype risk equaling the product of the risks associated with the part of each bipartition which the haplotype belongs. More precisely, let A 1, A 2,, A n be proper non-empty subsets of the haplotype space and r 1, r 2,, r n be the corresponding risk values, then for each haplotype h the model assigns risk s 1 s 2 s n to h, where, for i = 1, 2,, n, s i is r i if h is in A i and 1 otherwise,. Two models, each defined by a series of bipartitions and risk values, are nested when one series, the null model, is a sub-series of the other, the alternative model, i.e., the alternative model allows some haplotypes assumed to have the same risk in the null model to have different risks. The models are nested in the classical sense that the null model is a special case of the alternative model. Hence traditional generalized likelihood ratio tests can be used to test the null model against the alternative model. The degrees of freedom of the test are the difference of the number of elements in the series under the alternative hypothesis and the null hypothesis. Note that, with a multiplicative model, if haplotypes h i and h j are assumed to have the same risk, it corresponds to assuming that f i /p i = f j /p j where f and p denote haplotype frequencies in the affected population and the control population respectively. NEMO allows complete flexibility for construction of such models. For example, in Table 2, the results for allele A of rs , adjusting for the risk associated with allele A of rs , were generated through haplotype analyses that involved not only the SNPs rs and rs but also the 14 SNPs that form HapC. Specifically, under the null hypothesis any haplotype that contained the A allele of rs was assigned a common risk, r, while haplotypes that did not contain the A allele of rs were given risk 1. Thus, under the null hypothesis r was the risk of haplotypes carrying the A allele of rs , relative to the haplotypes that do not. Under the alternative hypothesis haplotypes carrying the A allele of rs but not the A allele of rs were again assigned risk r, but haplotypes that do 3

4 not carry the A allele of rs but do carry the A allele of rs were assigned risk s, and haplotypes carrying the A allele of rs and the A allele of rs were assigned risk rs. Haploypes carrying neither the A allele of rs nor the A allele of rs were assigned risk 1. We note that for this analysis the SNPs in HapC have no effect on risk and that the main reason for including the HapC SNPs is to ensure that the same data/individuals were used to generate results for all tests, although these SNPs may also provide some information about phase and improve information when genotypes are missing. It is worth noting when we first introduced NEMO 4, it could only incorporate models where the subsets of haplotypes, as defined above, are non-overlapping. Now that restriction has been eliminated, which, among other things, allowed us to fit models where risks of multiple variants are assumed to be multiplicative. Two-variant Models When the effects of rs A (or HapC) and rs A are evaluated jointly, there are four possible haplotypes: (A,A), (A,C),(C,A) and (C,C). Under the simplifying assumption that the risks of the two haplotypes an individual carried multiply, on the haplotype level, a full model will have 3 degrees of freedom, which can be parameterized as Risk(A,A)/Risk(C,C), Risk(A,C)/Risk(C,C) and Risk(C,A)/Risk(C,C). Supplementary Figure 1 shows the estimates (RR) that resulted from fitting this model combining the results from the four case-control groups of European descent. By contrast, a multiplicative model for the joint risk of the two variants has two degrees of freedom. The one degree of freedom reduction comes from the constraint that Risk(A,A)/Risk(C,C) = [Risk(A,C)/Risk(C,C)] [Risk(C,A)/Risk(C,C)]. Estimates in Table 2 and of RR m in Supplementary Figure 1 resulted from fitting this model. As can be seen from Supplementary Figure 1, RR and RR m are not very different from each other, indicating that the multiplicative model fits the data adequately. 4

5 Evaluation of genetic ancestry. We used the program Structure 5 to estimate the genetic ancestry of individuals. Structure infers the allele frequencies of K ancestral populations on the basis of multilocus genotypes from a set of individuals and a userspecified value of K, and assigns a proportion of ancestry from each of the inferred K populations to each individual. The analysis of our data set was run with K=2, with the aim of identifying the proportion of African and European ancestry in each individual. The statistical significance of the difference in mean European ancestry between African American patients and controls was evaluated by reference to a null distribution derived from 10,000 randomized datasets. To evaluate genetically estimated ancestry of the case-control groups from the US we selected 30 unlinked microsatellite markers from about 2000 microsatellites genotyped in a previously described 5 multi-ethnic cohort of 35 European Americans, 88 African Americans, 34 Chinese, and 29 Mexican Americans. Of the 2000 microsatellite markers the selected set showed the most significant differences between European Americans, African Americans, and Asians, and also had good quality and yield. PCR screening of cdna libraries. To confirm the expression of the spliced ESTs (BU852210, BU and DB093012) within the 99 kb LD block we screened commercially available cdna libraries and libraries generated at decode. The commercial libraries screened were Prostate Marathon-Ready cdna library (Clontech Cat ) and Bone marrow-ready cdna library (Clontech Cat ). In addition cdna libraries were constructed for fresh frozen normal- and tumor prostate tissue, whole blood and EBV-transformed human lymphoblastoid cells. Total RNA was isolated from the lymphoblastoid cell lines and whole blood, using the RNeasy RNA isolation kit from (Qiagen Cat ) and the RNeasy RNA isolation from whole blood kit (Cat ), respectively. RNA was isolated from fresh-frozen prostate tissue sections by homogenizing 30 mg of tissue in Tri Reagent from Molecular Research Centre (Cat. TR 118) followed by RNA isolation according to the manufacturer instructions. RNA was subsequently analyzed and quantified using the Agilent 2001 Bioanalyser. cdna libraries were prepared at decode using a random hexamer protocol from the RevertAid TM H Minus First Strand cdna Synthesis Kit (Fermentas Cat. K1631). The PCR reactions were done in 10ul volume at a final concentration of 3,5µM of forward and reverse primers, 2mM dntp, 1x Advantage 2 5

6 PCR buffer and 0,5ul of cdna library. PCR screening was carried out using the Advantage 2 PCR Enzyme RT _PCR System (Clontech) according to manufacturers instructions and using PCR primers from Operon Biotechnologies (See Supplementary Table 4c). An expression of the spliced ESTs was not detected in any of the libraries. 6

7 REFERENCES 1. Kutyavin, I.V. et al. A novel endonuclease IV post-pcr genotyping system. Nucleic Acids Research 34, e128 (2006). 2. Amundadottir, L.T. et al. A common variant associated with prostate cancer in European and African populations. Nat Genet 38, (2006). 3. Rice, J. Mathematical Statistics and Data Analysis, (Wadsworth Inc., Belmont, CA, 1995). 4. Gretarsdottir, S. et al. The gene encoding phosphodiesterase 4D confers risk of ischemic stroke. Nat Genet 35, (2003). 5. Pritchard, J.K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155, (2000). 7