A genome wide association study of metabolic traits in human urine

Size: px
Start display at page:

Download "A genome wide association study of metabolic traits in human urine"

Transcription

1 Supplementary material for A genome wide association study of metabolic traits in human urine Suhre et al. CONTENTS SUPPLEMENTARY FIGURES Supplementary Figure 1: Regional association plots surrounding the loci reported in Table 1. Supplementary Figure 2: QQ plots for the metabolic traits reported in Table 1. Supplementary Figure 3: Box plots of the metabolic traits that associate most strongly with the primary associated SNP at the loci reported in Table 1. SUPPLEMENTARY TABLE Supplementary Table 1. MS Excel data file, providing 15,475 associations that are significant at the 5% level after correcting for testing 1,720 metabolic traits at a single locus (p < 2.9x10 5 ) and that have a p gain > 59 in the case of ratios. Supplementary Table 2. MS Excel data file providing associations based on 1000 Genomes data imputed genotypes at the five loci reported in Table 1. SUPPLEMENTARY NOTE Description of the quality control questions of the SHIP genotyping effort and additional tests of robustness and further information for the reported associations. Supplementary material, page 1

2 (a) Regional association plot for SNP rs37369 with 3 Aminoisobutyrate. (b) Regional association plot for SNP rs with Formate / Succinate. Supplementary material, page 2

3 (c) Regional association plot for SNP rs with 2 Hydroxy isobutyrate. (d) Regional association plot for SNP rs with Lysine / Valine. Supplementary material, page 3

4 (e) Regional association plot for SNP rs with Alanine / N,N Dimethylglycine. Supplementary Figure 1: Regional association plots surrounding the loci reported in Table 1. Statistical significance of associated SNPs at this locus are shown on the log10(p) scale as a function of chromosomal position (NCBI build 36). The primary associated SNP at each locus is shown in red. The correlation of the primary SNP to other SNPs at the locus is shown on a scale from minimal (white) to maximal (bright red). Estimated recombination rates from HapMap and RefSeq annotations are shown. These plots were generated using the SNAP web server (based on HAPMAP release 22); accessed at on Supplementary material, page 4

5 (a) (b) (c) (d) (e) Supplementary Figure 2: QQ plots Supplementary material, page 5

6 Supplementary material, page 6

7 Supplementary material, page 7

8 Supplementary Figure 3: Box plots of the metabolic traits that associate most strongly with the primary associated SNP at the loci reported in Table 1; plotted as a function of genotype and experiment (from light to dark grey: major allele homozygotes, heteorozygotes, minor allele homozygotes). Boxes extend from 1st quartile (Q1) to 3rd quartile (Q3); median is indicated as a horizontal line; whiskers are drawn to the observation that is closest to, but not more than, a distance of 1.5(Q3 Q1) from the end of the box. Observations that are more distant than this are shown individually on the plots; 3 aminoisobutyrate and 2 hydroxy isobutyrate reported in [mmol/mol creatinine], the metabolite ratios have no units. The number of individuals per group is indicated above the boxes. Supplementary material, page 8

9 Supplementary note Description of the quality control questions of the SHIP genotyping effort and additional tests of robustness concerning the reported associations. Acknowledgements SHIP is part of the Community Medicine Research net of the University of Greifswald, Germany, which is funded by the Federal Ministry of Education and Research (grants no. 01ZZ9603, 01ZZ0103, and 01ZZ0403), the Ministry of Cultural Affairs as well as the Social Ministry of the Federal State of Mecklenburg West Pomerania. Genome wide data have been supported by the Federal Ministry of Education and Research (grant no. 03ZIK012) and a joint grant from Siemens Healthcare, Erlangen, Germany and the Federal State of Mecklenburg West Pomerania. The University of Greifswald is a member of the Center of Knowledge Interchange program of the Siemens AG. This work is also part of the research project Greifswald Approach to Individualized Medicine (GANI_MED). The GANI_MED consortium is funded by the Federal Ministry of Education and Research and the Ministry of Cultural Affairs of the Federal State of Mecklenburg West Pomerania (03IS2061A). Study population The Study of Health in Pomerania (SHIP) is a cross sectional survey in West Pomerania, the north east area of Germany (for references see main manuscript). A sample from the population aged 20 to 79 years was drawn from population registries. First, the three cities of the region (with 17,076 to 65,977 inhabitants) and the 12 towns (with 1,516 to 3,044 inhabitants) were selected, and then 17 out of 97 smaller towns (with less than 1,500 inhabitants), were drawn at random. Second, from each of the selected communities, subjects were drawn at random, proportional to the population size of each community and stratified by age and gender. Only individuals with German citizenship and main residency in the study area were included. Finally, 7,008 subjects were sampled, with 292 persons of each gender in each of the twelve five year age strata. In order to minimize drop outs by migration or death, subjects were selected in two waves. The net sample (without migrated or deceased persons) comprised 6,267 eligible subjects. Selected persons received a maximum of three written invitations. In case of non response, letters were followed by a phone call or by home visits if contact by phone was not possible. The SHIP population finally comprised 4,308 participants (corresponding to a final response of 68.8%). Genotyping The SHIP samples were genotyped using the Affymetrix Human SNP Array 6.0. Hybridisation of genomic DNA was done in accordance with the manufacturer s standard recommendations. The genetic data analysis workflow was created using the Software InforSense. Genetic data were stored using the database Caché (InterSystems). Genotypes were determined using the Birdseed2 clustering algorithm. For quality control purposes, several control samples where added. On the chip level, only subjects with a genotyping rate on QC probe sets (QC call rate) of at least 86% were included. Finally, all arrays had a sample call rate > 92%. The overall genotyping efficiency of the GWA was %. Supplementary material, page 9

10 Testing for nonrandom missingness SNPs with NoCall on 4081 Arrays Number of SNPs Number of Arrays with NoCall 100 Observed Expected Figure A: Distribution of the number of SNPs with the observed and expected NoCalls under a Poisson distribution assuming random missingness using all 4081 Affymetrix arrays that passed quality checks. The expected distribution of NoCall SNPs can be clearly seen as the Gaussian like distribution. The x axis has been limited to 100 arrays. The Poisson model does not optimally reflect the NoCall rate, demonstrating that the array call rate error might result from a distribution of different SNP specific call rates. This was supported by the observation that some SNPs had NoCalls on almost all of the 4000 arrays, whereas many others had call rates of 100%. Prediction of missing genotypes by neighboring haplotypes The estimation of missing genotypes was performed by imputation of genotypes using the software IMPUTE v0.5.0 based on the HapMap II haplotype panel. The current analysis was performed using directly genotyped SNPs only without imputed data. Supplementary material, page 10

11 Eliminating outliers by genetic ancestry Figure B: A further step of quality control (QC) of the SHIP genotype dataset included checks for possible duplicates among the individuals. For this purpose an independent subset of high quality genotyped SNPs was extracted. Only SNPs on autosomes having call rate greater 97% and a minor allele frequency greater 1% and being in Hardy\ Weinberg equilibrium (phwe>0.001) were included. SNPs were considered as independent if their multiple correlation coefficient was R²<0.5 within a window of 50 consecutive SNPs, whereas the window was shifted for 5 SNPs. Based on the remaining 141,804 SNPs, a pairwise identity by descent (IBD) estimation was calculated using PLINK version The results are shown in the Figure, whereas seven pairs among all samples could be identified as identical and 162 as first degree relatives (pi_hat between 0.4 and 0.6). Supplementary material, page 11

12 Accounting for population stratification Figure C: Plot of the first two principal components of the PCA of the genetic data from all genotyped SHIP participants. Blue, orange and red spots represent individuals exceeding 8 standard deviations in the first, first two or any of the first 10 principal components, respectively, after 5 iterations. Green spots correspond to individuals exceeding 8 standard deviations in the first principal component at the first iteration only. The first three principal components explain 0.38% of the variance. After removing any outliers after 5 iterations in the way described above, the variance explained by the first three principal components was reduced to 0.16%. On the independent SNP dataset (described above), a principal component analysis (PCA) was conducted and the first 10 eigenvectors were calculated using the software SMARTPCA from the package EIGENSOFT version 3.0. This offered the possibility to adjust for genetic population substructure or remove corresponding outliers, if necessary. Since no detailed information on genetic origin or relationship of the participants was available, no further analysis could be performed on this topic. No association to the array call rates could be observed for the principal components. The first 10 eigenvectors were included as covariates into the model and the most extreme individuals (only 6 concerning our study) were excluded as described. We then tested our five loci for association as before, adding the 10 PCA components as covariates. In no case reached the association between the metabolic traits and one of the 10 PCA components the significance level of 5%. Moreover, after exclusion of the six outliers, the strength of association did not change considerably compared to the values reported in Table 1. This shows for the five loci we report here that population stratification does not play a role and that the signal is not driven by a handful of individuals. Supplementary material, page 12

13 Manual NMR metabolite annotation Targeted Profiling of the urine samples was done in a blinded manner by three analyst using Chenomx NMR Suite 6.1. A full peer review was performed by a different analyst for each of the NMR spectra that was annotated. Finally a third and final overall review was done to account for any gross errors such as missing values and mixed identifications. The reproducibility of the data can be broken down into two sources of error. Error or variability introduced by the NMR machine itself, and the variability introduced by Targeted Profiling. The error from the NMR machine is negligible. The signal to noise ratio for these spectra measured on a 400 MHz spectrometer for 1D Proton NMR using 32 scan was calculated to be 0.5%. The coefficient of variation of the error introduced by using Chenomx NMR Suite to calculate absolute concentrations of metabolites is on average less than 5% for a trained user for each metabolite, as reported by Chenomx internal testing. The error varies for different metabolites and different types of mixtures. The error of identifying metabolites using the Chenomx library is more difficult to measure since there isn't a practical way to test these using standard urine mixtures. The results reported for this project only contained positively identified metabolites. Supplementary material, page 13

14 Distribution of metabolic traits and scaling Supplementary material, page 14

15 Figure D: Distribution plots for the SHIP 0 data from the discovery set; unscaled (left) and log scaled data (right) for the metabolic traits reported in Table 1. The log scaled distribution of 3 Aminoisobutyrate presents a multi modal structure which reveals the strong influence of the genotype onto this metabolic trait. Most of the other metabolic traits present similar distributions. Based on this observation and prior experience we decided to apply lognormal scaling of the data prior to testing for association. An additional argument for using log scaled data together with ratios is the property log(a/b) = log(b/a) which allows to halve the multiple testing burden, which normally goes along with testing all against all metabolite ratio pairs Supplementary material, page 15

16 Metabolic traits measured on the SHIP 0 and SHIP 1 visits for the same individuals trait N R (unscaled) R (logscaled) 3 Aminoisobutyrate Formate / Succinate Hydroxyisobutyrate Lysine / Valine Alanine / N,N Dimethylglycine Figure E: Scatter plots for metabolic traits measured in the same individuals during SHIP 0 and again SHIP 1 five years later, colored by genotype of the associating SNP: major allele (blue), heterozygotes (green) minor allele (red); and Pearson coefficient (R) between metabolic traits measured in the same individuals during SHIP 0 and again SHIP 1, computed both, based on unscaled and on log scaled data. We consider this as an additional test of robustness since it shows that the genetic contribution to the metabolic phenotype of the individuals remains stable over a period of five year, as one would expect it to be. It argues against a spurious association with a very stable phenotypic signal. Supplementary material, page 16

17 Regional association plots based on imputed genotype data Supplementary material, page 17

18 Figure F: Regional association plots based on 1000 Genomes Project imputed data (pilot 1 genotypes released March 2010; phased haplotypes released June 2010) using the PLINK option to analyze "dosage" SNP datasets from imputation packages. In all five cases the leading genotyped SNPs (blue) are comparable in their strength of association to the imputed SNPs (black) if one considers the additional multiple testing burden that is induced by the use of the additional imputed data (>10 4 in all cases). The association data presented in these plots is provided as Supplementary Table 2. Supplementary material, page 18