Oral Cleft Targeted Sequencing Project

Similar documents
SNP calling. Jose Blanca COMAV institute bioinf.comav.upv.es

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

GBS Usage Cases: Non-model Organisms. Katie E. Hyma, PhD Bioinformatics Core Institute for Genomic Diversity Cornell University

SUPPLEMENTARY INFORMATION

S G. Design and Analysis of Genetic Association Studies. ection. tatistical. enetics

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

H3A - Genome-Wide Association testing SOP

Why can GBS be complicated? Tools for filtering & error correction. Edward Buckler USDA-ARS Cornell University

Genome-Wide Association Studies (GWAS): Computational Them

Lecture 3: Introduction to the PLINK Software. Summer Institute in Statistical Genetics 2015

Understanding genetic association studies. Peter Kamerman

Lecture 3: Introduction to the PLINK Software. Summer Institute in Statistical Genetics 2017

DNA Collection. Data Quality Control. Whole Genome Amplification. Whole Genome Amplification. Measure DNA concentrations. Pros

Using the Association Workflow in Partek Genomics Suite

Human SNP haplotypes. Statistics 246, Spring 2002 Week 15, Lecture 1

Novel Variant Discovery Tutorial

Genome-wide association studies (GWAS) Part 1

SNP calling and VCF format

Comparing a few SNP calling algorithms using low-coverage sequencing data

Genotype quality control with plinkqc Hannah Meyer

Haplotypes, linkage disequilibrium, and the HapMap

ARTICLE High-Resolution Detection of Identity by Descent in Unrelated Individuals

Why do we need statistics to study genetics and evolution?

Office Hours. We will try to find a time

PLINK gplink Haploview

Let s call the recessive allele r and the dominant allele R. The allele and genotype frequencies in the next generation are:

Genotype Prediction with SVMs

EPIB 668 Genetic association studies. Aurélie LABBE - Winter 2011

Improving the accuracy and efficiency of identity by descent detection in population

What is genetic variation?

Why can GBS be complicated? Tools for filtering, error correction and imputation.

Personal Genomics Platform White Paper Last Updated November 15, Executive Summary

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

CUMACH - A Fast GPU-based Genotype Imputation Tool. Agatha Hu

Population stratification. Background & PLINK practical

Using the Trio Workflow in Partek Genomics Suite v6.6

Human linkage analysis. fundamental concepts

Answers to additional linkage problems.

Human linkage analysis. fundamental concepts

BICF Variant Analysis Tools. Using the BioHPC Workflow Launching Tool Astrocyte

THE HEALTH AND RETIREMENT STUDY: GENETIC DATA UPDATE

Factors affecting statistical power in the detection of genetic association

SEGMENTS of indentity-by-descent (IBD) may be detected

Variation Chapter 9 10/6/2014. Some terms. Variation in phenotype can be due to genes AND environment: Is variation genetic, environmental, or both?

Prostate Cancer Genetics: Today and tomorrow

General aspects of genome-wide association studies

Assignment 9: Genetic Variation

Midterm 1 Results. Midterm 1 Akey/ Fields Median Number of Students. Exam Score

RV-TDT: Rare Variant Extensions of the Transmission Disequilibrium Test

Association studies (Linkage disequilibrium)

Analysis of genome-wide genotype data

Prioritization: from vcf to finding the causative gene

B) You can conclude that A 1 is identical by descent. Notice that A2 had to come from the father (and therefore, A1 is maternal in both cases).

Exploring the Genetic Basis of Congenital Heart Defects

Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4

Summary for BIOSTAT/STAT551 Statistical Genetics II: Quantitative Traits

Genetic Variation and Genome- Wide Association Studies. Keyan Salari, MD/PhD Candidate Department of Genetics

Genetic data concepts and tests

An introduction to genetics and molecular biology

MONTE CARLO PEDIGREE DISEQUILIBRIUM TEST WITH MISSING DATA AND POPULATION STRUCTURE

Crash-course in genomics

QTL Mapping Using Multiple Markers Simultaneously

MPG NGS workshop I: SNP calling

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

Quality Control Report for Exome Chip Data University of Michigan April, 2015

Redefine what s possible with the Axiom Genotyping Solution

Topics in Statistical Genetics

Whole Genome Sequencing. Biostatistics 666

UHT Sequencing Course Large-scale genotyping. Christian Iseli January 2009

Supplementary Note: Detecting population structure in rare variant data

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang

Human Genetic Variation. Ricardo Lebrón Dpto. Genética UGR

Human Genetics and Gene Mapping of Complex Traits

Haplotype phasing in large cohorts: Modeling, search, or both?

Enhanced Resolution and Statistical Power Through SNP Distributions Within the Short Tandem Repeats

Population Genetics. If we closely examine the individuals of a population, there is almost always PHENOTYPIC

UAB DNA-Seq Analysis Workshop. John Osborne Research Associate Centers for Clinical and Translational Science

Module 2: Introduction to PLINK and Quality Control

ARTICLE Haplotype Estimation Using Sequencing Reads

Application of Genotyping-By-Sequencing and Genome-Wide Association Analysis in Tetraploid Potato

b. (3 points) The expected frequencies of each blood type in the deme if mating is random with respect to variation at this locus.

Prof. Dr. Konstantin Strauch

Population and Statistical Genetics including Hardy-Weinberg Equilibrium (HWE) and Genetic Drift

Genome wide association studies. How do we know there is genetics involved in the disease susceptibility?

PUBH 8445: Lecture 1. Saonli Basu, Ph.D. Division of Biostatistics School of Public Health University of Minnesota

Algorithms for Genetics: Introduction, and sources of variation

Introduction to Quantitative Genomics / Genetics

Implementing direct and indirect markers.

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer.

Structure, Measurement & Analysis of Genetic Variation

Estimation problems in high throughput SNP platforms

Genetics and Psychiatric Disorders Lecture 1: Introduction

Package snpready. April 11, 2018

Jean-Simon Brouard 1, Brian Boyle 2, Eveline M. Ibeagha-Awemu 1 and Nathalie Bissonnette 1*

Using VarSeq to Improve Variant Analysis Research

Lecture 23: Causes and Consequences of Linkage Disequilibrium. November 16, 2012

Goal: To use GCTA to estimate h 2 SNP from whole genome sequence data & understand how MAF/LD patterns influence biases

Biology 445K Winter 2007 DNA Fingerprinting

BST227 Introduction to Statistical Genetics. Lecture 3: Introduction to population genetics

Transcription:

Oral Cleft Targeted Sequencing Project Oral Cleft Group January, 2013 Contents I Quality Control 3 1 Summary of Multi-Family vcf File, Jan. 11, 2013 3 2 Analysis Group Quality Control (Proposed Protocol) 3 2.1 vcftools....................................... 4 3 Targeted Region Capture & Read Generation 7 4 Sequence Alignment and Processing 7 5 Sample QC 7 6 Relationship QC 7 7 De novo and inherited variant calling 8 8 Generation of Multi-family SNP genotypes 9 II Descriptive Statistics 10 9 SNPs and SNVs 10 10 MAFs and Heterozygosity 10 III Gen. Epi. 11 11 Polymorphisms (maf 0.01) 11 11.1 LD & Ethnicity................................... 11 11.2 Linkage & Association............................... 11 11.2.1 snpstats................................... 11 11.2.2 trio..................................... 11 12 Rare Variants (maf < 0.01) 11 12.1 Scan Statistic for Rare Variants in Trios..................... 11 12.2 de novo mutations................................. 11 1

List of Tables 1 Ethnicites...................................... 5 2 genotype flags.................................... 5 List of Figures 1 Missingness..................................... 6 2 GQ histogram.................................... 6 2

Part I Quality Control 1 Summary of Multi-Family vcf File, Jan. 11, 2013 In total, the vcf file contains 4,495 individuals and 175,189 markers. The target regions span 6.7 MB, for a marker density of 1 marker per 38 bp. Of these 4,495 subjects in the vcf file only 4,139 are contained in the pedigree file. The breakdown of subjects in the vcf and pedigree file, by ethnicity is given in Table 1. In Figure 1 we display the missingness per subject and per site, and in Figure 2 the genotypic quality (GQ) averaged across subjects. 2 Analysis Group Quality Control (Proposed Protocol) We begin by defining the characteristics on which to filter, and what the criterion for exclusion is. We divided the filters into three types (as does vcftools), Genotype, Subject and Site filters. The following is an outline of our initial protocol. 1. Genotype Filters Remove all non- PASS flagged genotypes (See Table 2) vcftools --remove-filtered-geno FLAGNAME Filter on genotypic GQ 40 and Depth 10 vcftools --mingq 40 --mindp 10 2. Subject Filters Remove Subject with missingness 0.028 vcftools --mind 0.972 Remove Subjects with average coverage 20 vcftools --min-indv-meandp 20 3. Site Filters Remove Markers with missingness 0.05 vcftools --geno 0.95 Remove markers with mean depth 20 vcftools --min-meandp 20 Di-allelic variants only vcftools --vcf file1.vcf --min-alleles 2 --max-alleles 2 Quality, Filter and Info are not available in our vcf file. Should we remove markers with mean GQ 90? Perhaps we should use the median?. We have not implemented a site-wide filter on GQ as of Jan. 11., 2013. We should investigate Mendelian inconsistencies and Hardy-Weinberg equilibrium. 3

2.1 vcftools We implement the above filter with one run of vcftools in which we call the command. vcftools --gzvcf $vcf \ --remove-filtered-geno NRC \ --remove-filtered-geno SB1 \ --remove-filtered-geno IRC \ --remove-filtered-geno MMQSD50 \ --remove-filtered-geno PB10 \ --remove-filtered-geno MQD30 \ --remove-filtered-geno DETP20 \ --remove-filtered-geno MVC4 \ --remove-filtered-geno HPMR5 \ --remove-filtered-geno MVF5 \ --remove-filtered-geno RLD25 \ --mingq 40 \ --mindp 10 \ --mind 0.972 \ --min-indv-meandp 20 \ --geno 0.95 \ --min-meandp 20 \ --min-alleles 2 \ --max-alleles 2 \ --recode \ --out $out \ 1> $logfile 2> $errfile bgzip $out.recode.vcf tabix $out.recode.vcf.gz It took almost six and a half hours to run, but took very little memory. Job 515989 (qc) Complete User = syounkin Queue = gwas.q@compute-0-43.local Host = compute-0-43.local Start Time = 01/10/2013 18:59:15 End Time = 01/11/2013 01:17:38 User Time = 06:14:17 System Time = 00:00:55 Wallclock Time = 06:18:23 CPU = 06:15:12 Max vmem = 191.148M Exit Status = 0 I do not know the order in which the filters were processed. This could make a difference. I suppose running it through the filter a second time could alleviate some of those concerns. Although the results still could be order-dependent, it is likely that the differences will be insignificant. 4

Ethnicity Count European 968 Chinese 1,371 Filipino 1,776 Guatemalan 24 In VCF and Pedigree 4,139 In VCF 4,495 In Pedigree 4,998 Table 1: Ethnicites Flag Description NRC Unable to grab readcounts for variant allele SB1 Reads supporting the variant have less than 0.01 fraction of the reads on one strand, but reference supporting reads are not similarly biased IRC Unable to grab any sort of readcount for either the reference or the variant allele MMQSD50 Difference in average mismatch quality sum between variant and reference supporting reads is greater than 50 PB10 Average position on read less than 0.10 or greater than 0.9 fraction of the read length MQD30 Difference in average mapping quality sum between variant and reference supporting reads is greater than 30 DETP20 Average distance of the variant base to the effective 3 end is less than 0.20 MVC4 Less than 4 high quality reads support the variant HPMR5 Variant is flanked by a homopolymer of the same base and of length greater than or equal to 5 MVF5 Variant allele frequency is less than 0.05 RLD25 Difference in average clipped read length between variant and reference supporting reads is greater than 25 Table 2: Flags for genotypes found in vcf file. All genotypes with any of these flags were removed with vcftools. (Presumably, to remove a genotype the call is set to missing.) 5

Histogram of missing.snp Histogram of missing.subject Frequency 0 5000 10000 15000 20000 25000 Frequency 0 10 20 30 40 0.0 0.4 0.8 0.020 0.024 0.028 missing.snp missing.subject Figure 1: Missingness Cleft Targeted Sequencing Frequency 0 20000 60000 100000 140000 0 20 40 60 80 100 Mean GQ per marker Figure 2: GQ histogram 6

3 Targeted Region Capture & Read Generation Do we need to discuss the methods behind the physical targeting of the regions? I m curious to know how we created the fragments for sequencing. 4 Sequence Alignment and Processing Data is aligned with BWA1 v0.5.9 with quality trimming (-q 5) to remove low quality bases at the ends of reads to the GRCh37-lite reference sequence. Data from individual runs is merged, if necessary, with Picard v1.46 (http://picard.sourceforge.net). All reads are deduplicated using Picard MarkDuplicates. 5 Sample QC Coverage across the target regions is evaluated using RefCov2 and >70% of targets must reach an average coverage of 20X in order to pass QC. If genotypes from another platform are available, a genotyping concordance QC is performed by comparing genotypes called using Samtools and to those from the outside platform. Any samples with an overall concordance of below 90% are flagged. Columns from this QC report are listed below: 1. SNPs called: SNPs reported by Samtools 2. With Genotype: The SNPs called are compared to the imported SNPs by position, so only the SNPs in common with the external data can be compared. 3. MetMinDepth: The SNP sites have to have a minimum depth of coverage at that position of 20X. Anything with lower coverage will be ignored in the concordance check. 4. Reference: How many SNP calls match the reference sequence (ie, build 37). 5. RefMatch: How many of the SNP sites that match the reference sequence also match the external array data. 6. Variant: How many SNP calls are different than the reference sequence (ie, build 37). 7. VarMatch: How many variant SNP sites match the external array data. Whether or not the different calls changed from heterozygous to homozygous or vice versa for both reference mismatches and variant calls is also evaluated. Finally, the % concordance is calculated as: (RefMatch + VarMatch)/MetMinDepth. 6 Relationship QC All offspring are required to have a significant relationship with their parents. To evaluate this, BEAGLE s fastibd command is used to calculate the identity by descent between children and their expected parents. This is done at the family level using both common and private SNPs within the target region. Variant sites are included in the calculation with the following criteria: the site is in the target region, is variant in at least one individual, and has 20X coverage in all individuals. After fastibd evaluation of these sites, the number of shared markers between each parent-child pair is calculated. If every marker is shared, 7

then those two individuals share 50% of their genome (this is the max that fastibd can detect, since it doesn t consider both haplotypes together when comparing individuals). If less than 40% of the target region is shared between parent and child in this way, the family is flagged as failing QC. If a family fails the initial, family-level QC evaluation (i.e., one parent is not highly related to child), then that entire family is subsequently evaluated as part of a pool containing all families failing the initial QC. For this cross-family IBD assessment, sites are selected as follows: the site is in the target region, is variant in any individual and individual genotypes are set to missing if coverage in that individual is <20X. As with the initial QC, if less than 40% of the target-region is shared between parent and child in this way, the family is flagged as failing the QC. If this cross-family QC identifies high IBD sharing between two ostensibly unrelated samples, manual checking is performed to confirm a sample swap. 7 De novo and inherited variant calling De novo variants and inherited variants were called using polymutt 0.11 (https://github. com/ernfrid/polymutt, http://genome.sph.umich.edu/wiki/polymutt) with the calling restricted to chromosomes containing target regions and all other options set to their defaults. GLF files were generated for input to polymutt using samtools-0.1.7a-hybrid (https://github.com/statgen/samtools-0.1.7a-hybrid) with BAQ applied as in the following command: samtools-hybrid view uh some.bam samtools-hybrid calmd Aur refseq.fa 2> /dev/null samtools-hybrid pileup - -g r refseq.fa > output.glf Polymutt has two modes of variant calling, one for standard calling and one for de novo mutation calling. The VCF files for both of these modes were merged into a single VCF for each family and filters were applied. We used bam-readcount v0.4 (https://github.com/genome/bamreadcount) with a minimum base quality of 15 (-b 15) to generate metrics (for both de novo and germline variant calls) and marked sites as filtered based on the following requirements: 1. Minimum variant base frequency at the site of 5% 2. Percent of reads supporting the variant on the plus strand 1% and 99% (variants failing these criteria are filtered only if the reads supporting the reference do not show a similar bias) 3. Minimum variant base count of 4 4. Variant falls within the middle 90% of the aligned portion of the read 5. Maximum difference between the quality sum of mismatching bases in reads supporting the variant and reads supporting the reference of 100 6. Maximum mapping quality difference between reads supporting the variant and reads supporting the reference of 30 7. Maximum difference in aligned read length between reads supporting the variant base and reads supporting the reference base of 25 8. Minimum average distance to the effective 3 end1 of the read for variant supporting reads of 20% of the sequenced read length 8

9. Maximum length of a flanking homopolymer run of the variant base of 5. In addition to the above filters, a binomial test filter is applied to the de novo calls in order to remove likely false positive de novo calls. The input to the test is calculated by generating readcounts with base quality? 15 and mapping quality? 20 for all unaffected family members at a putative de novo mutation location. For each individual, reads are divided into either supporting the de novo allele or supporting some other allele. Using an assumed error rate of 0.01, the probability that the reads came from a binomial distribution where p = 0.01 (fraction of reads supporting the de novo allele) is calculated. If the resulting p-value is less than 10 4 for any one unaffected, the de novo prediction is marked as failing the filter. 8 Generation of Multi-family SNP genotypes Once all variant sites in all samples were predicted, sites were limited to the precise target space of the capture product, buffered by 500 bp on either side and aggregated into a list of segregating sites for the cohort. Each segregating site was (re)genotyped in all samples using polymuttvcf-0.01 (polymutt pos). The resulting genotypes were added to sites missing from the original VCF for each sample in order to distinguish between missing data and homozygous reference calls. The resulting single-sample VCF files, containing genotypes for all segregating sites, were subsequently merged using joinx1.6 (http://gmt.genome.wustl.edu/joinx). All variant calls from this process are included in the final files. 9

Part II Descriptive Statistics 9 SNPs and SNVs Here we can breakdown the distribution of polymorphic markers (SNPs) vs. non-polymorphic (SNVs). 10 MAFs and Heterozygosity Here we can display the distribution of minor allele frequencies and heterozygosity across ethnicities. 10

Part III Gen. Epi. We begin with the common variants. Note that we refer to variants with estimated population allele frequency greater than 0.01 as polymorphisms. 11 Polymorphisms (maf 0.01) 11.1 LD & Ethnicity Here we can investigate LD structure and how it differs across ethnicities. Likewise with heterozygosity. 11.2 Linkage & Association We begin with four different formulations of the classic Transmission Disequilibrium Test. The TDT tests for an increase in the transmission rate from parent to offspring. If that rate is significantly greater than the expect rate under Mendelian inheritence of 1 2, then we conclude that transmission is preferred over non-transmission among the affected offspring, and the genetic marker is nearby the causal locus. 11.2.1 Clayton s regression-based test Allelic or genotypic. R Package: snpstats Given large-scale SNP data for families comprising both parents and one or more affected offspring, this function computes 1 df tests (the TDT test) and a 2 df test based on observed and expected transmissions of genotypes. Tests based on imputation rules can also be carried out. 11.2.2 Holger s gtdt and standard TDT Allelic or genotypic. R Package: trio 12 Rare Variants (maf < 0.01) 12.1 Scan Statistic for Rare Variants in Trios We have received and compiled the C++ source code for scan-trios from Dr. Iuliana Ionita- Laza at Columbia University Dept. of Biostatistics. Now we need to make the input files. The required input files are: pedigree, map, regions and weights. The pedigree file is slightly atypical, as the rows must be ordered to indicate trio structure. It does not use the information present in the pedigree file to order the data, it must be done manually. I do not know what the pedigree files look like that we have made using vcftools and plink. 12.2 de novo mutations 11