H3A - Genome-Wide Association testing SOP

Similar documents
PLINK gplink Haploview

Genome-Wide Association Studies (GWAS): Computational Them

General aspects of genome-wide association studies

Derrek Paul Hibar

S SG. Metabolomics meets Genomics. Hemant K. Tiwari, Ph.D. Professor and Head. Metabolomics: Bench to Bedside. ection ON tatistical.

Axiom mydesign Custom Array design guide for human genotyping applications

Runs of Homozygosity Analysis Tutorial

Oral Cleft Targeted Sequencing Project

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

Gap hunting to characterize clustered probe signals in Illumina methylation array data

Author's response to reviews

Why can GBS be complicated? Tools for filtering, error correction and imputation.

By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs

Association Mapping in Plants PLSC 731 Plant Molecular Genetics Phil McClean April, 2010

Personal Genomics Platform White Paper Last Updated November 15, Executive Summary

SNP calling and VCF format

Introduction to statistics for Genome- Wide Association Studies (GWAS) Day 2 Section 8


Using the Trio Workflow in Partek Genomics Suite v6.6

Supplementary Note: Detecting population structure in rare variant data

MMAP Genomic Matrix Calculations

High Cross-Platform Genotyping Concordance of Axiom High-Density Microarrays and Eureka Low-Density Targeted NGS Assays

Illumina s GWAS Roadmap: next-generation genotyping studies in the post-1kgp era

Linking Genetic Variation to Important Phenotypes

Mapping strategies for sequence reads

Genotype SNP Imputation Methods Manual, Version (May 14, 2010)

Concepts and relevance of genome-wide association studies

The Genetic Structure of the Swedish Population

Additional file 8 (Text S2): Detailed Methods

Supplementary Figures

KNN-MDR: a learning approach for improving interactions mapping performances in genome wide association studies

Supplementary Material for Extremely low-coverage whole genome sequencing in South Asians captures population genomics information

UK Biobank Axiom Array

University of Groningen. The value of haplotypes Vries, Anne René de

SNPassoc: an R package to perform whole genome association studies

Linkage Disequilibrium. Adele Crane & Angela Taravella

Haplotype phasing in large cohorts: Modeling, search, or both?

Custom TaqMan Assays DESIGN AND ORDERING GUIDE. For SNP Genotyping and Gene Expression Assays. Publication Number Revision G

Population Genetics (Learning Objectives)

BST227 Introduction to Statistical Genetics. Lecture 8: Variant calling from high-throughput sequencing data

An introduction to genetics and molecular biology

Conifer Translational Genomics Network Coordinated Agricultural Project

Data Descriptor: Sequence data and association statistics from 12,940 type 2 diabetes cases and controls

Fine-scale mapping of meiotic recombination in Asians

Sequencing Millions of Animals for Genomic Selection 2.0

HHS Public Access Author manuscript Nat Biotechnol. Author manuscript; available in PMC 2012 May 07.

INTRODUCTION TO GENETIC EPIDEMIOLOGY (Antwerp Series) Prof. Dr. Dr. K. Van Steen

Mapping and Mapping Populations

Evaluation of Genome wide SNP Haplotype Blocks for Human Identification Applications

Introduction to the UCSC genome browser

Section 4 - Guidelines for DNA Technology. Version October, 2017

Methods for the analysis of nuclear DNA data

OBJECTIVES-ACTIVITIES 2-4

Accelerate High Throughput Analysis for Genome Sequencing with GPU

Basic Concepts of Human Genetics

>3 Sequencing coverage (x)

Whole genome sequencing in drug discovery research: a one fits all solution?

SNPchiMp v A database to disentangle the SNPchip jungle. Latest update: January 2014.

Amapofhumangenomevariationfrom population-scale sequencing

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

NordicDB: A Nordic pool and portal for genome-wide control data

Analyzing Affymetrix GeneChip SNP 6 Copy Number Data in Partek

What could the pig sector learn from the cattle sector. Nordic breeding evaluation in dairy cattle

An introductory overview of the current state of statistical genetics

Phasing of 2-SNP Genotypes based on Non-Random Mating Model

Characterization of Allele-Specific Copy Number in Tumor Genomes

ARTICLE Leveraging Multi-ethnic Evidence for Mapping Complex Traits in Minority Populations: An Empirical Bayes Approach

ISAG FAO Genetic Diversity

Genetics Culminating Project

Genotype instability during long-term subculture of lymphoblastoid cell lines

Question. In the last 100 years. What is Feed Efficiency? Genetics of Feed Efficiency and Applications for the Dairy Industry

Getting high-quality cytogenetic data is a SNP.

Targeted Sequencing Using Droplet-Based Microfluidics. Keith Brown Director, Sales

Specific tools for genomics in UNIX: bedtools, bedops, vc:ools, Course: Work with genomic data in the UNIX April 2015

Outline. Analysis of Microarray Data. Most important design question. General experimental issues

ARTICLE Rapid and Accurate Haplotype Phasing and Missing-Data Inference for Whole-Genome Association Studies By Use of Localized Haplotype Clustering

Chromosome Analysis Suite 3.0 (ChAS 3.0)

Overview One of the promises of studies of human genetic variation is to learn about human history and also to learn about natural selection.

BMC Bioinformatics. Open Access. Abstract

GENOME-WIDE ASSOCIATION STUDIES: THEORETICAL AND PRACTICAL CONCERNS

Beyond single genes or proteins

Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

SNP-array based Copy number analysis in R R course,

Statistical Methods for Genome Wide Association Studies

Basic Concepts of Human Genetics

Variant calling in NGS experiments

A GENOTYPE CALLING ALGORITHM FOR AFFYMETRIX SNP ARRAYS

Hardy-Weinberg Principle

Sequence variation in the short tandem repeat system SE33 discovered by next generation sequencing

Harnessing the power of RADseq for ecological and evolutionary genomics

"Genetics in geographically structured populations: defining, estimating and interpreting FST."

Lees J.A., Vehkala M. et al., 2016 In Review

POPULATION GENETICS Winter 2005 Lecture 18 Quantitative genetics and QTL mapping

Genotype AA Aa aa Total N ind We assume that the order of alleles in Aa does not play a role. The genotypic frequencies follow as

Genomic Selection Using Low-Density Marker Panels

File S1 Technical details of a SNP array optimized for population genetics

USER MANUAL for the use of the human Genome Clinical Annotation Tool (h-gcat) uthors: Klaas J. Wierenga, MD & Zhijie Jiang, P PhD

RADSeq Data Analysis. Through STACKS on Galaxy. Yvan Le Bras Anthony Bretaudeau Cyril Monjeaud Gildas Le Corguillé

Transcription:

H3A - Genome-Wide Association testing SOP Introduction File format Strand errors Sample quality control Marker quality control Batch effects Population stratification Association testing Replication Meta analyses

Phase 2 (quality control) 1. check for overall population structure a. between cases vs controls b. batch effects c. amongst reference populations (eg KGP) 2. relatedness Phase 3 (disease association testing) 1. LD calculation 2. Impute missing genotypes (if applicable) 3. Association testing 4. Local ancestry Introduction The following document describes some recommended steps for processing genotype data from a genome wide association study, as will be generated by many of the H3Africa projects. Although it is written with data from SNP genotyping arrays in mind, many of the steps also apply to full genome sequence and exome sequencing data. Many of the tests can be performed with the PLINK software suite, but where other software is required this will be indicated. File format This SOP assumes that incoming data is in the binary format used by the PLINK software suite. The PLINK binary format (hereafter referred to as bped) encodes a dataset as a set of three files, with the following suffixes to their names:.bed: this is the binary genotype data, stored as 2 bits per genotype per sample.bim: a text file containing marker information, one line per marker, in genomic order.fam: a text file containing sample pedigree information Strand errors As a byproduct of the genotyping technology, and the annotation data that accompanies the chips, SNP data from both Illumina and Affymetrix platforms may be reported as the allele on either the forward or reverse strand. Although the information to orient these calls properly is contained within the annotation data, conversion from the native format to PLINK format can be tricky, and errors will not be evident in ambiguous A T/G C SNPs. Therefore it is prudent to check that the alleles reported in the bim file match the known alleles in dbsnp. Another easy check is to identify the control samples that are often included in the dataset (eg samples from HapMap) and compare these to their known genotypes.

Sample quality control Before association testing can be done, some sample QC steps should be taken to minimise confounding factors. Some of these filters would have been applied at the genotype calling phase, but it is worth confirming them if the genotype calling was not done internally. Sample call rate Samples with poor call rates should have been removed at the genotype calling stage, since this is a strong indicator of poor sample quality (perhaps due to problems in collection, processing or hybridisation) and will adversely affect the intensity clusters used to call genotypes, affecting the calling of all samples. A typical filter is to remove samples with a call rate lower than 98%. This filter should be run after the marker call rate filter mentioned elsewhere in this document, to reduce the effect of poorly performing probes on a sample s call rate Samples with a low call rate can be removed with PLINK s mind option. Identity errors The sex of an individual can be determined from the genotype data, and should be compared to the declared sex recorded in the phenotype records. Samples with a mismatched sex call should be investigated to determine whether this points to errors in other samples. Checking sex of a sample can be done with PLINK s check sex option. Samples with chromosomal abnormalities Chromosomal abnormalities such as aneuploidy or long stretches of homozygosity should be tested for and the causes identified. These cases should be examined with a geneticist to determined possible causes and decide whether the sample should be removed. Control samples To assist in the estimation of genotyping accuracy, samples with well known genotypes (for example from the HapMap project) are often included in the genotyping pipeline. These genotypes should be compared to the known data, and can be used as a measure of accuracy. Another quality control strategy is to include replicates from the group under study. These should be compared with each other and markers that are not concordant should be filtered. Concordance between two samples can be measured with PLINK s bmerge and merge mode options Reported relationships between samples Another check of sample identity is to confirm that relatedness between individuals matches their reported relationships. This can serve to highlight: labelling issues average degree of relatedness between individuals across the whole population non paternity

duplicate samples not labelled as such Pairwise relatedness can be computed using PLINK s genome option Marker quality control Marker call rate On arrays with hundreds of thousands of probes, some markers can be expected to perform poorly. These should be removed from the dataset because it is likely that even the samples that were successfully called are inaccurate due to poor clustering of the hybridisation intensities. A typical filter is to remove markers with a call rate lower than 98% Samples with a low call rate can be removed with PLINK s geno option. Minor allele frequency Removal of SNPs with a low MAF (minor allele frequency) is often recommended because they have low statistical power, increase the multiple test correction required, and might correlate with poor calling. A typical threshold is 0.01, but this is highly dependent on the size of the study. Power calculation Hardy Weinberg equilibrium Hardy Weinberg equilibrium should be calculated for each SNP, and flagged since this can point to several factors such as population stratification, poor genotype calling or even true associations with the study traits. An in depth examination of HWE and GWAS can be found in 1 Calculation and use of the Hardy Weinberg model in association studies. HWE can be calculated in PLINK with the hardy option. Batch effects Batch effects refers to systematic differences in results that correlate with factors not under study, such as: sampling site 96 well plates date of sample processing For multi center studies (which many H3Africa studies are), phenotype standardisation should be confirmed. Systematic differences in phenotype collection can introduce batch effects which should be tested for. These should be tested for by measuring across each batch 1 http://www.ncbi.nlm.nih.gov/pubmed/18428419

Population stratification Population stratification refers to structure within the sample group that is the result of systematic genetic differences between individuals that correlate with the phenotypic data. This could result in allele frequency differences that are due to ancestral proportion differences between cases and controls being mistaken for an association with the phenotype. Two methods can be used to examine population substructure: STRUCTURE/Admixture Allele frequency based methods, such as STRUCTURE. Add known reference populations such as hapmap pops Determine most likely K Look for: differences between possible batch samples that cluster with the wrong group look for individuals that cluster with the wrong population, or have admixture pattern different from others other samples in the same group Eigensoft PCA based methods, such as Eigensoft. Eigensoft s smartpca, and examine the top 10 eigenvectors. If any of these eigenvectors display segmentation based on a potential confounder, these factors should be examine further, and possibly corrected for. Individuals that differ significantly from the rest in terms of cluster membership should be investigated for possible removal. A common threshold in Eigensoft is to remove individuals that are outliers by more than 6 standard deviations. Both these methods should be used on the data in an exploratory fashion, to try identify possible confounding factors such as: case/control status. One of the assumptions in GWAS is that overall, the case and control groups have the same genetic background. If stratification is different between the two groups, this could lead to false positives batch effects. Check for homogeneity across 96 well plates, collection centers etc. A common test is to perform PCA on the dataset, and label the samples according to these phenotypes other collected data. e.g. socioeconomic status structure in comparison to reference populations such as those from HapMap/1000 genomes Outliers identified in PCA analysis should also be removed from further analysis, since these could indicate poor samples, mislabelled individuals

Association testing Single locus tests The most basic test of association is the single locus test. 1. LD calculation 2. Impute missing genotypes 3. Association testing a. Use plink to do a simple association testing b. use EMMAX which adjusts for population stratification/relatedness Adjustment for covariates

Replication The relatively low power of most GWAS designs means that a substantial number of false positives can be expected to be generated, so validation via replication in an independent population is an important part of most studies. A common strategy is to genotype a subset of the samples on a high coverage array, and then follow up with targeted genotyping of markers that appear interesting, on a larger number of samples Meta analyses References: