MMAP Genomic Matrix Calculations

Similar documents
Goal: To use GCTA to estimate h 2 SNP from whole genome sequence data & understand how MAF/LD patterns influence biases

PLINK gplink Haploview

THE HEALTH AND RETIREMENT STUDY: GENETIC DATA UPDATE

H3A - Genome-Wide Association testing SOP

GCTA/GREML. Rebecca Johnson. March 30th, 2017

Package RVFam. March 10, 2015

Lecture 3: Introduction to the PLINK Software. Summer Institute in Statistical Genetics 2015

Lecture 3: Introduction to the PLINK Software. Summer Institute in Statistical Genetics 2017

Statistical Methods in Bioinformatics

Traditional Genetic Improvement. Genetic variation is due to differences in DNA sequence. Adding DNA sequence data to traditional breeding.

Lab 1: A review of linear models

Package snpready. April 11, 2018

GENOME WIDE ASSOCIATION STUDY OF INSECT BITE HYPERSENSITIVITY IN TWO POPULATION OF ICELANDIC HORSES

Quantitative Genetics

AN EVALUATION OF POWER TO DETECT LOW-FREQUENCY VARIANT ASSOCIATIONS USING ALLELE-MATCHING TESTS THAT ACCOUNT FOR UNCERTAINTY

Using the Trio Workflow in Partek Genomics Suite v6.6

Runs of Homozygosity Analysis Tutorial

Supplementary Note: Detecting population structure in rare variant data

RV-TDT: Rare Variant Extensions of the Transmission Disequilibrium Test

Answers to additional linkage problems.

Department of Molecular Biology and Genetics, Aarhus University, DK-8830 Tjele, Denmark. Nordic Cattle Genetic Evaluation, DK-8200 Aarhus N, Denmark

Introduction to Quantitative Genomics / Genetics

Genomic Selection in Cereals. Just Jensen Center for Quantitative Genetics and Genomics

The potential of genotyping pooled DNA to leverage commercial phenotypes for genetic improvement of beef cattle

Computations with Markers

Genomic Selection with Linear Models and Rank Aggregation

Genotype Prediction with SVMs

Genome Wide Association Study for Binomially Distributed Traits: A Case Study for Stalk Lodging in Maize

Survival Outcome Prediction for Cancer Patients based on Gene Interaction Network Analysis and Expression Profile Classification

Imputation. Genetics of Human Complex Traits

arxiv: v1 [stat.ap] 31 Jul 2014

Should the Markers on X Chromosome be Used for Genomic Prediction?

CUMACH - A Fast GPU-based Genotype Imputation Tool. Agatha Hu

How large-scale genomic evaluations are possible. Daniela Lourenco

Genomic Selection in Dairy Cattle

Bulked Segregant Analysis For Fine Mapping Of Genes. Cheng Zou, Qi Sun Bioinformatics Facility Cornell University

Haplotype phasing in large cohorts: Modeling, search, or both?

Haplotype Based Association Tests. Biostatistics 666 Lecture 10

Statistical Tools for Predicting Ancestry from Genetic Data

General aspects of genome-wide association studies

Introduction to Genome Wide Association Studies 2014 Sydney Brenner Institute for Molecular Bioscience/Wits Bioinformatics Shaun Aron

High-density SNP Genotyping Analysis of Broiler Breeding Lines

BST227 Introduction to Statistical Genetics. Lecture 8: Variant calling from high-throughput sequencing data

B) You can conclude that A 1 is identical by descent. Notice that A2 had to come from the father (and therefore, A1 is maternal in both cases).


Additional file 8 (Text S2): Detailed Methods

Package trigger. R topics documented: August 16, Type Package

Package FSTpackage. June 27, 2017

Genotype quality control with plinkqc Hannah Meyer

Population differentiation analysis of 54,734 European Americans reveals independent evolution of ADH1B gene in Europe and East Asia

Nature Genetics: doi: /ng.3143

Interbull s Activities in Genomic Evaluation

Basic Concepts of Human Genetics

Use of marker information in PIGBLUP v5.20

PoultryTechnical NEWS. GenomChicks Advanced layer genetics using genomic breeding values. For further information, please contact us:

Dan Geiger. Many slides were prepared by Ma ayan Fishelson, some are due to Nir Friedman, and some are mine. I have slightly edited many slides.

Genome-Wide Association Studies (GWAS): Computational Them

Genomic Selection in R

New Features in JMP Genomics 4.1 WHITE PAPER

Derrek Paul Hibar

Oral Cleft Targeted Sequencing Project

Split input data, map to hg19, split by chromosome, and variant call with GATK

Autozygosity by difference a method for locating autosomal recessive mutations. Geoff Pollott

CHAPTER 4 EXAMPLES: EXPLORATORY FACTOR ANALYSIS

Assignment 9: Genetic Variation

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

Estimation problems in high throughput SNP platforms

UHT Sequencing Course Large-scale genotyping. Christian Iseli January 2009

Lees J.A., Vehkala M. et al., 2016 In Review

Genomic prediction for numerically small breeds, using models with pre selected and differentially weighted markers

Lecture 6: GWAS in Samples with Structure. Summer Institute in Statistical Genetics 2015

Hierarchical Linear Modeling: A Primer 1 (Measures Within People) R. C. Gardner Department of Psychology

BLUPF90 suite of programs for animal breeding with focus on genomics

Basic Concepts of Human Genetics

By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs

Single-cell sequencing

ARTICLE ROADTRIPS: Case-Control Association Testing with Partially or Completely Unknown Population and Pedigree Structure

User s Guide Version 1.10 (Last update: July 12, 2013)

Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz

Personal Genomics Platform White Paper Last Updated November 15, Executive Summary

A genome wide association study of metabolic traits in human urine

Lecture 6: Association Mapping. Michael Gore lecture notes Tucson Winter Institute version 18 Jan 2013

Nature Genetics: doi: /ng Supplementary Figure 1. H3K27ac HiChIP enriches enhancer promoter-associated chromatin contacts.

Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Supplementary information

MPG NGS workshop I: SNP calling

Computational Genomics

Algorithms for Genetics: Introduction, and sources of variation

QTL Mapping, MAS, and Genomic Selection

, 2018 (1) AGIL

Whole Genome Sequencing. Biostatistics 666

Next Generation Genetics: Using deep sequencing to connect phenotype to genotype

HISTORICAL LINGUISTICS AND MOLECULAR ANTHROPOLOGY

Genome-wide association studies (GWAS) Part 1

Fitting. a classical twin model using SOLAR

GENETICS - CLUTCH CH.20 QUANTITATIVE GENETICS.

Axiom mydesign Custom Array design guide for human genotyping applications

Proceedings of the World Congress on Genetics Applied to Livestock Production 11.64

Chapter 1 What s New in SAS/Genetics 9.0, 9.1, and 9.1.3

POLYMORPHISM AND VARIANT ANALYSIS. Matt Hudson Crop Sciences NCSA HPCBio IGB University of Illinois

Transcription:

Last Update: 9/28/2014 MMAP Genomic Matrix Calculations MMAP has options to compute relationship matrices using genetic markers. The markers may be genotypes or dosages. Additive and dominant covariance matrices are available using SNP or pooled variances. Options are available for single or double precision matrices with or without adjusting for missing data. The memory footprint is determined by the size of the subject group requested and number of markers. Options to store the number of markers is available to combine matrices across genomic regions as weighted sums. Parallel calculations are controlled by the number of threads requested. PCs can be extracted from any genomic matrix. The default option in MMAP is double precision matrix multiplication of the Z Z where Z is a subject-by-marker matrix of normalized genotype values. The sums in Z Z are generally scaled by the number of observed genotypes to compute the average covariance between each pair. The matrix multiplication Z Z default is double precision (DP). The counts of the observed genotypes between each pair is single precision (SP). DP requires twice the memory and compute time as SP. Thus, the total computational footprint is T= DP + SP = 3SP. There are options available to trade speed/memory for accuracy. First, the double precision multiplication can be replaced by single precision to reduce T to 2SP. For the range of values the genomic matrices, single precision is generally sufficient. Second, if there is no missing data (full genotyping or imputed genotypes) or little missing genotype data, the exact count between subjects can be replaced by the number of markers to eliminate the SP count multiplication. Then T=SP, which is 3 times faster than the DP and exact count model. Also the file size is smaller. The impact on downstream analysis of DP vs SP and exact count vs average may depend the size of your data (number of subjects and markers). If genomic matrices are to be combined, the counts need to be added to the files, in which case, the two options to compute the average are available. Options for X chromosome matrices are being added. There is no pedigree information required for calculation of autosomal matrices. Additive matrices 1. Each SNP has own variance. Generally used in human genetics. Default setting. The average can be taken over the number of observed pairs or the number of markers. 2. SNPs have common variance. Generally used for genomic prediction. Requires pooled_variance option. The pooled variance uses all the markers.

Dominance matrices Genotype coding: AA=0, AB=BB=1. Similar to the additive matrix there are two forms 1. Each SNP has its own variance. Default setting. The average can be taken over the number of observed pairs or the number of markers 2. SNPs have common variance. Generally used for genomic prediction. Requires pooled_variance option. The pooled variance uses all the markers. KING genomic matrices Adding the optio king_homo option will generate the genomic covariance matrix based on Equation 5 in ref [1]. 1/ 2 X X 2 p (1 p ) ij m m m m KING robust covariance matrices are being implemented. Runtime options i j 2 --write_binary_gmatrix_file Additive matrix calculation --write_binary_dmatrix_file Dominance matrix calculation --binary_genotype_filename <SxM file> A subject-by-marker binary genotype file --binary_output_filename <file> Output filename --group_size <num> Controls the subject-by-subject group size that is computed to fill in binary. This option impacts that memory footprint at the subject level. Generally the group_size should be as large as possible for efficiency.

--pooled_variance Use 2 nd definition of above matrices. Note that pooled variance is less impacted by MAF. --single_pedigree Generate pairwise covariance values for all subjects, independent of the pedigree. --write_matrix_counts Add counts to the binary output file. Required if genomic matrices are to be combined. Default is to write pairwise counts base on observed data. --use_complete_data_count Uses the number of markers for n ij in the above formulas to compute the average. Can be combined with write_matrix_counts. --num_mkl_threads <num> Parallelization for MKL matrix multiplication --autosome Extract the autosomal SNPs --chromosome <numbers> Extract SNP on chromosomes in <numbers> --genomic_region <chr> <start bp> <stop bp> Extract SNPs in the genomic region(s) specified. --marker_set <file> Use markers in <file> --subject_set <file> Use subjects in <file> --min_minor_allele_frequency <value> Restrict analysis to markers with MAF greater than <value>. Not used for imputed data. PC calculations MMAP computes the PCs then outputs to csv delimited file --compute_pc_file MMAP option --binary_input_filename <file> Binary covariance file to extract PCs --max_pcs2print <num> Print the <num> PCs with largest eigenvalues --pc_output_filename <file> Output file for the PCs. Sorted by largest to smallest eigenvalue --subject_set <file> Restrict the PC calculation to subjects in the file. Note that in practice PCs are computed for all subjects in binary then used for all downstream analyses independent of the number of subjects in the phenotype file. The PCs restricted to subsets of the subjects are no longer orthogonal, but the impact is minimal. --print_scaled_pcs Also print eigenvectors scaled by the sqrt of the eigenvalue Extracting Values from Matrix

There are two options to extract the pairwise values from a binary covariance matrix and an option to extract the full matrix. Subject set options are being added to be able extract specified subjects. --variance_component_matrix_mmap2pairs Output each pair once. The order depends on the order of the subjects in the file. The output is the upper diagonal of the matrix --variance_component_matrix_mmap2pairs_all Output each pair of different subjects twice with opposite orders. The output is the full matrix --variance_component_matrix_mmap2matrix Rectangular output with the subject ids as the first row. Both options require: --binary_input_filename <file> --csv_output_filename <file> Combining Genomic Matrices --combine_binary_matrix_files <file list> List of files to be combined. They must all be of the same count type, that is, have been created with the same --write_matrix_counts option with or without the --use_complete_data_count option. --binary_output_filename <output file> Name of the combined genomic matrix file Example Commands 1. Double precision, pairwise adjustment to average, group size 4000 and autosome markers included. Running time 3xtime(SP). mmap --write_binary_gmatrix_file --binary_genotype_filename <SxM file> -- binary_output_filename study.0.5.bin --group_size 4000 --single_pedigree -- num_mkl_threads 4 --min_minor_allele_frequency 0.05 --autosome 2. bash shell commands to create dominance matrix by chromosome then combine into single file. Group size is 1000. Single precision and constant adjustment to average, so running time is time(sp) for chr in {1..22} do mmap --write_binary_dmatrix_file --binary_genotype_filename <SxM file> -- binary_output_filename dom.${chr}.0.5.bin --group_size 1000 --single_pedigree --

write_matrix_counts --num_mkl_threads 4 --min_minor_allele_frequency 0.05 chromosome ${chr} single_precision use_complete_data_count done filelist=`ls dom.*.0.5.bin` mmap --combine_binary_matrix_files $filelist --binary_output_filename dom.auto.0.5.bin 3. bash shell commands to create leave-one-chromosome out (LOO) matrices # get file list. Assume all files have name G.<chr>.bin ls G.*.bin > filelist list=`cat filelist` for loo in {1..22} do # delete loo chromosome grep -v $loo filelist >loolist looinput=`cat loolist` # combine the 21 gmatrices $prog --combine_binary_matrix_files $looinput --binary_output_filename G.LOO.${loo}.bin done # combine for autosome file $prog --combine_binary_matrix_files $filelist --binary_output_filename G.autosome.bin 1. Manichaikul, A., et al., Robust relationship inference in genome-wide association studies. Bioinformatics, 2010. 26(22): p. 2867-73.