Oral Cleft Targeted Sequencing Project

Size: px
Start display at page:

Download "Oral Cleft Targeted Sequencing Project"

Transcription

1 Oral Cleft Targeted Sequencing Project Oral Cleft Group January, 2013 Contents I Quality Control 3 1 Summary of Multi-Family vcf File, Jan. 11, Analysis Group Quality Control (Proposed Protocol) vcftools Targeted Region Capture & Read Generation 7 4 Sequence Alignment and Processing 7 5 Sample QC 7 6 Relationship QC 7 7 De novo and inherited variant calling 8 8 Generation of Multi-family SNP genotypes 9 II Descriptive Statistics 10 9 SNPs and SNVs MAFs and Heterozygosity 10 III Gen. Epi Polymorphisms (maf 0.01) LD & Ethnicity Linkage & Association snpstats trio Rare Variants (maf < 0.01) Scan Statistic for Rare Variants in Trios de novo mutations

2 List of Tables 1 Ethnicites genotype flags List of Figures 1 Missingness GQ histogram

3 Part I Quality Control 1 Summary of Multi-Family vcf File, Jan. 11, 2013 In total, the vcf file contains 4,495 individuals and 175,189 markers. The target regions span 6.7 MB, for a marker density of 1 marker per 38 bp. Of these 4,495 subjects in the vcf file only 4,139 are contained in the pedigree file. The breakdown of subjects in the vcf and pedigree file, by ethnicity is given in Table 1. In Figure 1 we display the missingness per subject and per site, and in Figure 2 the genotypic quality (GQ) averaged across subjects. 2 Analysis Group Quality Control (Proposed Protocol) We begin by defining the characteristics on which to filter, and what the criterion for exclusion is. We divided the filters into three types (as does vcftools), Genotype, Subject and Site filters. The following is an outline of our initial protocol. 1. Genotype Filters Remove all non- PASS flagged genotypes (See Table 2) vcftools --remove-filtered-geno FLAGNAME Filter on genotypic GQ 40 and Depth 10 vcftools --mingq 40 --mindp Subject Filters Remove Subject with missingness vcftools --mind Remove Subjects with average coverage 20 vcftools --min-indv-meandp Site Filters Remove Markers with missingness 0.05 vcftools --geno 0.95 Remove markers with mean depth 20 vcftools --min-meandp 20 Di-allelic variants only vcftools --vcf file1.vcf --min-alleles 2 --max-alleles 2 Quality, Filter and Info are not available in our vcf file. Should we remove markers with mean GQ 90? Perhaps we should use the median?. We have not implemented a site-wide filter on GQ as of Jan. 11., We should investigate Mendelian inconsistencies and Hardy-Weinberg equilibrium. 3

4 2.1 vcftools We implement the above filter with one run of vcftools in which we call the command. vcftools --gzvcf $vcf \ --remove-filtered-geno NRC \ --remove-filtered-geno SB1 \ --remove-filtered-geno IRC \ --remove-filtered-geno MMQSD50 \ --remove-filtered-geno PB10 \ --remove-filtered-geno MQD30 \ --remove-filtered-geno DETP20 \ --remove-filtered-geno MVC4 \ --remove-filtered-geno HPMR5 \ --remove-filtered-geno MVF5 \ --remove-filtered-geno RLD25 \ --mingq 40 \ --mindp 10 \ --mind \ --min-indv-meandp 20 \ --geno 0.95 \ --min-meandp 20 \ --min-alleles 2 \ --max-alleles 2 \ --recode \ --out $out \ 1> $logfile 2> $errfile bgzip $out.recode.vcf tabix $out.recode.vcf.gz It took almost six and a half hours to run, but took very little memory. Job (qc) Complete User = syounkin Queue = gwas.q@compute-0-43.local Host = compute-0-43.local Start Time = 01/10/ :59:15 End Time = 01/11/ :17:38 User Time = 06:14:17 System Time = 00:00:55 Wallclock Time = 06:18:23 CPU = 06:15:12 Max vmem = M Exit Status = 0 I do not know the order in which the filters were processed. This could make a difference. I suppose running it through the filter a second time could alleviate some of those concerns. Although the results still could be order-dependent, it is likely that the differences will be insignificant. 4

5 Ethnicity Count European 968 Chinese 1,371 Filipino 1,776 Guatemalan 24 In VCF and Pedigree 4,139 In VCF 4,495 In Pedigree 4,998 Table 1: Ethnicites Flag Description NRC Unable to grab readcounts for variant allele SB1 Reads supporting the variant have less than 0.01 fraction of the reads on one strand, but reference supporting reads are not similarly biased IRC Unable to grab any sort of readcount for either the reference or the variant allele MMQSD50 Difference in average mismatch quality sum between variant and reference supporting reads is greater than 50 PB10 Average position on read less than 0.10 or greater than 0.9 fraction of the read length MQD30 Difference in average mapping quality sum between variant and reference supporting reads is greater than 30 DETP20 Average distance of the variant base to the effective 3 end is less than 0.20 MVC4 Less than 4 high quality reads support the variant HPMR5 Variant is flanked by a homopolymer of the same base and of length greater than or equal to 5 MVF5 Variant allele frequency is less than 0.05 RLD25 Difference in average clipped read length between variant and reference supporting reads is greater than 25 Table 2: Flags for genotypes found in vcf file. All genotypes with any of these flags were removed with vcftools. (Presumably, to remove a genotype the call is set to missing.) 5

6 Histogram of missing.snp Histogram of missing.subject Frequency Frequency missing.snp missing.subject Figure 1: Missingness Cleft Targeted Sequencing Frequency Mean GQ per marker Figure 2: GQ histogram 6

7 3 Targeted Region Capture & Read Generation Do we need to discuss the methods behind the physical targeting of the regions? I m curious to know how we created the fragments for sequencing. 4 Sequence Alignment and Processing Data is aligned with BWA1 v0.5.9 with quality trimming (-q 5) to remove low quality bases at the ends of reads to the GRCh37-lite reference sequence. Data from individual runs is merged, if necessary, with Picard v1.46 ( All reads are deduplicated using Picard MarkDuplicates. 5 Sample QC Coverage across the target regions is evaluated using RefCov2 and >70% of targets must reach an average coverage of 20X in order to pass QC. If genotypes from another platform are available, a genotyping concordance QC is performed by comparing genotypes called using Samtools and to those from the outside platform. Any samples with an overall concordance of below 90% are flagged. Columns from this QC report are listed below: 1. SNPs called: SNPs reported by Samtools 2. With Genotype: The SNPs called are compared to the imported SNPs by position, so only the SNPs in common with the external data can be compared. 3. MetMinDepth: The SNP sites have to have a minimum depth of coverage at that position of 20X. Anything with lower coverage will be ignored in the concordance check. 4. Reference: How many SNP calls match the reference sequence (ie, build 37). 5. RefMatch: How many of the SNP sites that match the reference sequence also match the external array data. 6. Variant: How many SNP calls are different than the reference sequence (ie, build 37). 7. VarMatch: How many variant SNP sites match the external array data. Whether or not the different calls changed from heterozygous to homozygous or vice versa for both reference mismatches and variant calls is also evaluated. Finally, the % concordance is calculated as: (RefMatch + VarMatch)/MetMinDepth. 6 Relationship QC All offspring are required to have a significant relationship with their parents. To evaluate this, BEAGLE s fastibd command is used to calculate the identity by descent between children and their expected parents. This is done at the family level using both common and private SNPs within the target region. Variant sites are included in the calculation with the following criteria: the site is in the target region, is variant in at least one individual, and has 20X coverage in all individuals. After fastibd evaluation of these sites, the number of shared markers between each parent-child pair is calculated. If every marker is shared, 7

8 then those two individuals share 50% of their genome (this is the max that fastibd can detect, since it doesn t consider both haplotypes together when comparing individuals). If less than 40% of the target region is shared between parent and child in this way, the family is flagged as failing QC. If a family fails the initial, family-level QC evaluation (i.e., one parent is not highly related to child), then that entire family is subsequently evaluated as part of a pool containing all families failing the initial QC. For this cross-family IBD assessment, sites are selected as follows: the site is in the target region, is variant in any individual and individual genotypes are set to missing if coverage in that individual is <20X. As with the initial QC, if less than 40% of the target-region is shared between parent and child in this way, the family is flagged as failing the QC. If this cross-family QC identifies high IBD sharing between two ostensibly unrelated samples, manual checking is performed to confirm a sample swap. 7 De novo and inherited variant calling De novo variants and inherited variants were called using polymutt 0.11 ( com/ernfrid/polymutt, with the calling restricted to chromosomes containing target regions and all other options set to their defaults. GLF files were generated for input to polymutt using samtools-0.1.7a-hybrid ( with BAQ applied as in the following command: samtools-hybrid view uh some.bam samtools-hybrid calmd Aur refseq.fa 2> /dev/null samtools-hybrid pileup - -g r refseq.fa > output.glf Polymutt has two modes of variant calling, one for standard calling and one for de novo mutation calling. The VCF files for both of these modes were merged into a single VCF for each family and filters were applied. We used bam-readcount v0.4 ( with a minimum base quality of 15 (-b 15) to generate metrics (for both de novo and germline variant calls) and marked sites as filtered based on the following requirements: 1. Minimum variant base frequency at the site of 5% 2. Percent of reads supporting the variant on the plus strand 1% and 99% (variants failing these criteria are filtered only if the reads supporting the reference do not show a similar bias) 3. Minimum variant base count of 4 4. Variant falls within the middle 90% of the aligned portion of the read 5. Maximum difference between the quality sum of mismatching bases in reads supporting the variant and reads supporting the reference of Maximum mapping quality difference between reads supporting the variant and reads supporting the reference of Maximum difference in aligned read length between reads supporting the variant base and reads supporting the reference base of Minimum average distance to the effective 3 end1 of the read for variant supporting reads of 20% of the sequenced read length 8

9 9. Maximum length of a flanking homopolymer run of the variant base of 5. In addition to the above filters, a binomial test filter is applied to the de novo calls in order to remove likely false positive de novo calls. The input to the test is calculated by generating readcounts with base quality? 15 and mapping quality? 20 for all unaffected family members at a putative de novo mutation location. For each individual, reads are divided into either supporting the de novo allele or supporting some other allele. Using an assumed error rate of 0.01, the probability that the reads came from a binomial distribution where p = 0.01 (fraction of reads supporting the de novo allele) is calculated. If the resulting p-value is less than 10 4 for any one unaffected, the de novo prediction is marked as failing the filter. 8 Generation of Multi-family SNP genotypes Once all variant sites in all samples were predicted, sites were limited to the precise target space of the capture product, buffered by 500 bp on either side and aggregated into a list of segregating sites for the cohort. Each segregating site was (re)genotyped in all samples using polymuttvcf-0.01 (polymutt pos). The resulting genotypes were added to sites missing from the original VCF for each sample in order to distinguish between missing data and homozygous reference calls. The resulting single-sample VCF files, containing genotypes for all segregating sites, were subsequently merged using joinx1.6 ( All variant calls from this process are included in the final files. 9

10 Part II Descriptive Statistics 9 SNPs and SNVs Here we can breakdown the distribution of polymorphic markers (SNPs) vs. non-polymorphic (SNVs). 10 MAFs and Heterozygosity Here we can display the distribution of minor allele frequencies and heterozygosity across ethnicities. 10

11 Part III Gen. Epi. We begin with the common variants. Note that we refer to variants with estimated population allele frequency greater than 0.01 as polymorphisms. 11 Polymorphisms (maf 0.01) 11.1 LD & Ethnicity Here we can investigate LD structure and how it differs across ethnicities. Likewise with heterozygosity Linkage & Association We begin with four different formulations of the classic Transmission Disequilibrium Test. The TDT tests for an increase in the transmission rate from parent to offspring. If that rate is significantly greater than the expect rate under Mendelian inheritence of 1 2, then we conclude that transmission is preferred over non-transmission among the affected offspring, and the genetic marker is nearby the causal locus Clayton s regression-based test Allelic or genotypic. R Package: snpstats Given large-scale SNP data for families comprising both parents and one or more affected offspring, this function computes 1 df tests (the TDT test) and a 2 df test based on observed and expected transmissions of genotypes. Tests based on imputation rules can also be carried out Holger s gtdt and standard TDT Allelic or genotypic. R Package: trio 12 Rare Variants (maf < 0.01) 12.1 Scan Statistic for Rare Variants in Trios We have received and compiled the C++ source code for scan-trios from Dr. Iuliana Ionita- Laza at Columbia University Dept. of Biostatistics. Now we need to make the input files. The required input files are: pedigree, map, regions and weights. The pedigree file is slightly atypical, as the rows must be ordered to indicate trio structure. It does not use the information present in the pedigree file to order the data, it must be done manually. I do not know what the pedigree files look like that we have made using vcftools and plink de novo mutations 11

SNP calling. Jose Blanca COMAV institute bioinf.comav.upv.es

SNP calling. Jose Blanca COMAV institute bioinf.comav.upv.es SNP calling Jose Blanca COMAV institute bioinf.comav.upv.es SNP calling Genotype matrix Genotype matrix: Samples x SNPs SNPs and errors A change in a read may due to: Sample contamination Cloning or PCR

More information

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014 Single Nucleotide Variant Analysis H3ABioNet May 14, 2014 Outline What are SNPs and SNVs? How do we identify them? How do we call them? SAMTools GATK VCF File Format Let s call variants! Single Nucleotide

More information

GBS Usage Cases: Non-model Organisms. Katie E. Hyma, PhD Bioinformatics Core Institute for Genomic Diversity Cornell University

GBS Usage Cases: Non-model Organisms. Katie E. Hyma, PhD Bioinformatics Core Institute for Genomic Diversity Cornell University GBS Usage Cases: Non-model Organisms Katie E. Hyma, PhD Bioinformatics Core Institute for Genomic Diversity Cornell University Q: How many SNPs will I get? A: 42. What question do you really want to ask?

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Contents De novo assembly... 2 Assembly statistics for all 150 individuals... 2 HHV6b integration... 2 Comparison of assemblers... 4 Variant calling and genotyping... 4 Protein truncating variants (PTV)...

More information

S G. Design and Analysis of Genetic Association Studies. ection. tatistical. enetics

S G. Design and Analysis of Genetic Association Studies. ection. tatistical. enetics S G ection ON tatistical enetics Design and Analysis of Genetic Association Studies Hemant K Tiwari, Ph.D. Professor & Head Section on Statistical Genetics Department of Biostatistics School of Public

More information

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère C3BI VARIANTS CALLING November 2016 Pierre Lechat Stéphane Descorps-Declère General Workflow (GATK) software websites software bwa picard samtools GATK IGV tablet vcftools website http://bio-bwa.sourceforge.net/

More information

H3A - Genome-Wide Association testing SOP

H3A - Genome-Wide Association testing SOP H3A - Genome-Wide Association testing SOP Introduction File format Strand errors Sample quality control Marker quality control Batch effects Population stratification Association testing Replication Meta

More information

Why can GBS be complicated? Tools for filtering & error correction. Edward Buckler USDA-ARS Cornell University

Why can GBS be complicated? Tools for filtering & error correction. Edward Buckler USDA-ARS Cornell University Why can GBS be complicated? Tools for filtering & error correction Edward Buckler USDA-ARS Cornell University http://www.maizegenetics.net Maize has more molecular diversity than humans and apes combined

More information

Genome-Wide Association Studies (GWAS): Computational Them

Genome-Wide Association Studies (GWAS): Computational Them Genome-Wide Association Studies (GWAS): Computational Themes and Caveats October 14, 2014 Many issues in Genomewide Association Studies We show that even for the simplest analysis, there is little consensus

More information

Lecture 3: Introduction to the PLINK Software. Summer Institute in Statistical Genetics 2015

Lecture 3: Introduction to the PLINK Software. Summer Institute in Statistical Genetics 2015 Lecture 3: Introduction to the PLINK Software Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2015 1 / 1 PLINK Overview PLINK is a free, open-source whole genome association analysis

More information

Understanding genetic association studies. Peter Kamerman

Understanding genetic association studies. Peter Kamerman Understanding genetic association studies Peter Kamerman Outline CONCEPTS UNDERLYING GENETIC ASSOCIATION STUDIES Genetic concepts: - Underlying principals - Genetic variants - Linkage disequilibrium -

More information

Lecture 3: Introduction to the PLINK Software. Summer Institute in Statistical Genetics 2017

Lecture 3: Introduction to the PLINK Software. Summer Institute in Statistical Genetics 2017 Lecture 3: Introduction to the PLINK Software Instructors: Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 20 PLINK Overview PLINK is a free, open-source whole genome

More information

DNA Collection. Data Quality Control. Whole Genome Amplification. Whole Genome Amplification. Measure DNA concentrations. Pros

DNA Collection. Data Quality Control. Whole Genome Amplification. Whole Genome Amplification. Measure DNA concentrations. Pros DNA Collection Data Quality Control Suzanne M. Leal Baylor College of Medicine sleal@bcm.edu Copyrighted S.M. Leal 2016 Blood samples For unlimited supply of DNA Transformed cell lines Buccal Swabs Small

More information

Using the Association Workflow in Partek Genomics Suite

Using the Association Workflow in Partek Genomics Suite Using the Association Workflow in Partek Genomics Suite This user guide will illustrate the use of the Association workflow in Partek Genomics Suite (PGS) and discuss the basic functions available within

More information

Human SNP haplotypes. Statistics 246, Spring 2002 Week 15, Lecture 1

Human SNP haplotypes. Statistics 246, Spring 2002 Week 15, Lecture 1 Human SNP haplotypes Statistics 246, Spring 2002 Week 15, Lecture 1 Human single nucleotide polymorphisms The majority of human sequence variation is due to substitutions that have occurred once in the

More information

Novel Variant Discovery Tutorial

Novel Variant Discovery Tutorial Novel Variant Discovery Tutorial Release 8.4.0 Golden Helix, Inc. August 12, 2015 Contents Requirements 2 Download Annotation Data Sources...................................... 2 1. Overview...................................................

More information

Genome-wide association studies (GWAS) Part 1

Genome-wide association studies (GWAS) Part 1 Genome-wide association studies (GWAS) Part 1 Matti Pirinen FIMM, University of Helsinki 03.12.2013, Kumpula Campus FIMM - Institiute for Molecular Medicine Finland www.fimm.fi Published Genome-Wide Associations

More information

SNP calling and VCF format

SNP calling and VCF format SNP calling and VCF format Laurent Falquet, Oct 12 SNP? What is this? A type of genetic variation, among others: Family of Single Nucleotide Aberrations Single Nucleotide Polymorphisms (SNPs) Single Nucleotide

More information

Comparing a few SNP calling algorithms using low-coverage sequencing data

Comparing a few SNP calling algorithms using low-coverage sequencing data Yu and Sun BMC Bioinformatics 2013, 14:274 RESEARCH ARTICLE Open Access Comparing a few SNP calling algorithms using low-coverage sequencing data Xiaoqing Yu 1 and Shuying Sun 1,2* Abstract Background:

More information

Genotype quality control with plinkqc Hannah Meyer

Genotype quality control with plinkqc Hannah Meyer Genotype quality control with plinkqc Hannah Meyer 219-3-1 Contents Introduction 1 Per-individual quality control....................................... 2 Per-marker quality control.........................................

More information

Haplotypes, linkage disequilibrium, and the HapMap

Haplotypes, linkage disequilibrium, and the HapMap Haplotypes, linkage disequilibrium, and the HapMap Jeffrey Barrett Boulder, 2009 LD & HapMap Boulder, 2009 1 / 29 Outline 1 Haplotypes 2 Linkage disequilibrium 3 HapMap 4 Tag SNPs LD & HapMap Boulder,

More information

ARTICLE High-Resolution Detection of Identity by Descent in Unrelated Individuals

ARTICLE High-Resolution Detection of Identity by Descent in Unrelated Individuals ARTICLE High-Resolution Detection of Identity by Descent in Unrelated Individuals Sharon R. Browning 1,2, * and Brian L. Browning 1,2 Detection of recent identity by descent (IBD) in population samples

More information

Why do we need statistics to study genetics and evolution?

Why do we need statistics to study genetics and evolution? Why do we need statistics to study genetics and evolution? 1. Mapping traits to the genome [Linkage maps (incl. QTLs), LOD] 2. Quantifying genetic basis of complex traits [Concordance, heritability] 3.

More information

Office Hours. We will try to find a time

Office Hours.   We will try to find a time Office Hours We will try to find a time If you haven t done so yet, please mark times when you are available at: https://tinyurl.com/666-office-hours Thanks! Hardy Weinberg Equilibrium Biostatistics 666

More information

PLINK gplink Haploview

PLINK gplink Haploview PLINK gplink Haploview Whole genome association software tutorial Shaun Purcell Center for Human Genetic Research, Massachusetts General Hospital, Boston, MA Broad Institute of Harvard & MIT, Cambridge,

More information

Let s call the recessive allele r and the dominant allele R. The allele and genotype frequencies in the next generation are:

Let s call the recessive allele r and the dominant allele R. The allele and genotype frequencies in the next generation are: Problem Set 8 Genetics 371 Winter 2010 1. In a population exhibiting Hardy-Weinberg equilibrium, 23% of the individuals are homozygous for a recessive character. What will the genotypic, phenotypic and

More information

Genotype Prediction with SVMs

Genotype Prediction with SVMs Genotype Prediction with SVMs Nicholas Johnson December 12, 2008 1 Summary A tuned SVM appears competitive with the FastPhase HMM (Stephens and Scheet, 2006), which is the current state of the art in genotype

More information

EPIB 668 Genetic association studies. Aurélie LABBE - Winter 2011

EPIB 668 Genetic association studies. Aurélie LABBE - Winter 2011 EPIB 668 Genetic association studies Aurélie LABBE - Winter 2011 1 / 71 OUTLINE Linkage vs association Linkage disequilibrium Case control studies Family-based association 2 / 71 RECAP ON GENETIC VARIANTS

More information

Improving the accuracy and efficiency of identity by descent detection in population

Improving the accuracy and efficiency of identity by descent detection in population Genetics: Early Online, published on March 27, 2013 as 10.1534/genetics.113.150029 Improving the accuracy and efficiency of identity by descent detection in population data Brian L. Browning *,1 and Sharon

More information

What is genetic variation?

What is genetic variation? enetic Variation Applied Computational enomics, Lecture 05 https://github.com/quinlan-lab/applied-computational-genomics Aaron Quinlan Departments of Human enetics and Biomedical Informatics USTAR Center

More information

Why can GBS be complicated? Tools for filtering, error correction and imputation.

Why can GBS be complicated? Tools for filtering, error correction and imputation. Why can GBS be complicated? Tools for filtering, error correction and imputation. Edward Buckler USDA-ARS Cornell University http://www.maizegenetics.net Many Organisms Are Diverse Humans are at the lower

More information

Personal Genomics Platform White Paper Last Updated November 15, Executive Summary

Personal Genomics Platform White Paper Last Updated November 15, Executive Summary Executive Summary Helix is a personal genomics platform company with a simple but powerful mission: to empower every person to improve their life through DNA. Our platform includes saliva sample collection,

More information

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016 CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016 Topics Genetic variation Population structure Linkage disequilibrium Natural disease variants Genome Wide Association Studies Gene

More information

CUMACH - A Fast GPU-based Genotype Imputation Tool. Agatha Hu

CUMACH - A Fast GPU-based Genotype Imputation Tool. Agatha Hu CUMACH - A Fast GPU-based Genotype Imputation Tool Agatha Hu ahu@nvidia.com Term explanation Figure resource: http://en.wikipedia.org/wiki/genotype Allele: one of two or more forms of a gene or a genetic

More information

Population stratification. Background & PLINK practical

Population stratification. Background & PLINK practical Population stratification Background & PLINK practical Variation between, within populations Any two humans differ ~0.1% of their genome (1 in ~1000bp) ~8% of this variation is accounted for by the major

More information

Using the Trio Workflow in Partek Genomics Suite v6.6

Using the Trio Workflow in Partek Genomics Suite v6.6 Using the Trio Workflow in Partek Genomics Suite v6.6 This user guide will illustrate the use of the Trio/Duo workflow in Partek Genomics Suite (PGS) and discuss the basic functions available within the

More information

Human linkage analysis. fundamental concepts

Human linkage analysis. fundamental concepts Human linkage analysis fundamental concepts Genes and chromosomes Alelles of genes located on different chromosomes show independent assortment (Mendel s 2nd law) For 2 genes: 4 gamete classes with equal

More information

Answers to additional linkage problems.

Answers to additional linkage problems. Spring 2013 Biology 321 Answers to Assignment Set 8 Chapter 4 http://fire.biol.wwu.edu/trent/trent/iga_10e_sm_chapter_04.pdf Answers to additional linkage problems. Problem -1 In this cell, there two copies

More information

Human linkage analysis. fundamental concepts

Human linkage analysis. fundamental concepts Human linkage analysis fundamental concepts Genes and chromosomes Alelles of genes located on different chromosomes show independent assortment (Mendel s 2nd law) For 2 genes: 4 gamete classes with equal

More information

BICF Variant Analysis Tools. Using the BioHPC Workflow Launching Tool Astrocyte

BICF Variant Analysis Tools. Using the BioHPC Workflow Launching Tool Astrocyte BICF Variant Analysis Tools Using the BioHPC Workflow Launching Tool Astrocyte Prioritization of Variants SNP INDEL SV Astrocyte BioHPC Workflow Platform Allows groups to give easy-access to their analysis

More information

THE HEALTH AND RETIREMENT STUDY: GENETIC DATA UPDATE

THE HEALTH AND RETIREMENT STUDY: GENETIC DATA UPDATE : GENETIC DATA UPDATE April 30, 2014 Biomarker Network Meeting PAA Jessica Faul, Ph.D., M.P.H. Health and Retirement Study Survey Research Center Institute for Social Research University of Michigan HRS

More information

Factors affecting statistical power in the detection of genetic association

Factors affecting statistical power in the detection of genetic association Review series Factors affecting statistical power in the detection of genetic association Derek Gordon 1 and Stephen J. Finch 2 1 Laboratory of Statistical Genetics, Rockefeller University, New York, New

More information

SEGMENTS of indentity-by-descent (IBD) may be detected

SEGMENTS of indentity-by-descent (IBD) may be detected INVESTIGATION Improving the Accuracy and Efficiency of Identity-by-Descent Detection in Population Data Brian L. Browning*,1 and Sharon R. Browning *Department of Medicine, Division of Medical Genetics,

More information

Variation Chapter 9 10/6/2014. Some terms. Variation in phenotype can be due to genes AND environment: Is variation genetic, environmental, or both?

Variation Chapter 9 10/6/2014. Some terms. Variation in phenotype can be due to genes AND environment: Is variation genetic, environmental, or both? Frequency 10/6/2014 Variation Chapter 9 Some terms Genotype Allele form of a gene, distinguished by effect on phenotype Haplotype form of a gene, distinguished by DNA sequence Gene copy number of copies

More information

Prostate Cancer Genetics: Today and tomorrow

Prostate Cancer Genetics: Today and tomorrow Prostate Cancer Genetics: Today and tomorrow Henrik Grönberg Professor Cancer Epidemiology, Deputy Chair Department of Medical Epidemiology and Biostatistics ( MEB) Karolinska Institutet, Stockholm IMPACT-Atanta

More information

General aspects of genome-wide association studies

General aspects of genome-wide association studies General aspects of genome-wide association studies Abstract number 20201 Session 04 Correctly reporting statistical genetics results in the genomic era Pekka Uimari University of Helsinki Dept. of Agricultural

More information

Assignment 9: Genetic Variation

Assignment 9: Genetic Variation Assignment 9: Genetic Variation Due Date: Friday, March 30 th, 2018, 10 am In this assignment, you will profile genome variation information and attempt to answer biologically relevant questions. The variant

More information

Midterm 1 Results. Midterm 1 Akey/ Fields Median Number of Students. Exam Score

Midterm 1 Results. Midterm 1 Akey/ Fields Median Number of Students. Exam Score Midterm 1 Results 10 Midterm 1 Akey/ Fields Median - 69 8 Number of Students 6 4 2 0 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101 Exam Score Quick review of where we left off Parental type: the

More information

RV-TDT: Rare Variant Extensions of the Transmission Disequilibrium Test

RV-TDT: Rare Variant Extensions of the Transmission Disequilibrium Test RV-TDT: Rare Variant Extensions of the Transmission Disequilibrium Test Copyrighted 2018 Zongxiao He & Suzanne M. Leal Introduction Many population-based rare-variant association tests, which aggregate

More information

Association studies (Linkage disequilibrium)

Association studies (Linkage disequilibrium) Positional cloning: statistical approaches to gene mapping, i.e. locating genes on the genome Linkage analysis Association studies (Linkage disequilibrium) Linkage analysis Uses a genetic marker map (a

More information

Analysis of genome-wide genotype data

Analysis of genome-wide genotype data Analysis of genome-wide genotype data Acknowledgement: Several slides based on a lecture course given by Jonathan Marchini & Chris Spencer, Cape Town 2007 Introduction & definitions - Allele: A version

More information

Prioritization: from vcf to finding the causative gene

Prioritization: from vcf to finding the causative gene Prioritization: from vcf to finding the causative gene vcf file making sense A vcf file from an exome sequencing project may easily contain 40-50 thousand variants. In order to optimize the search for

More information

B) You can conclude that A 1 is identical by descent. Notice that A2 had to come from the father (and therefore, A1 is maternal in both cases).

B) You can conclude that A 1 is identical by descent. Notice that A2 had to come from the father (and therefore, A1 is maternal in both cases). Homework questions. Please provide your answers on a separate sheet. Examine the following pedigree. A 1,2 B 1,2 A 1,3 B 1,3 A 1,2 B 1,2 A 1,2 B 1,3 1. (1 point) The A 1 alleles in the two brothers are

More information

Exploring the Genetic Basis of Congenital Heart Defects

Exploring the Genetic Basis of Congenital Heart Defects Exploring the Genetic Basis of Congenital Heart Defects Sanjay Siddhanti Jordan Hannel Vineeth Gangaram szsiddh@stanford.edu jfhannel@stanford.edu vineethg@stanford.edu 1 Introduction The Human Genome

More information

Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4

Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4 WHITE PAPER Oncomine Comprehensive Assay Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4 Contents Scope and purpose of document...2 Content...2 How Torrent

More information

Summary for BIOSTAT/STAT551 Statistical Genetics II: Quantitative Traits

Summary for BIOSTAT/STAT551 Statistical Genetics II: Quantitative Traits Summary for BIOSTAT/STAT551 Statistical Genetics II: Quantitative Traits Gained an understanding of the relationship between a TRAIT, GENETICS (single locus and multilocus) and ENVIRONMENT Theoretical

More information

Genetic Variation and Genome- Wide Association Studies. Keyan Salari, MD/PhD Candidate Department of Genetics

Genetic Variation and Genome- Wide Association Studies. Keyan Salari, MD/PhD Candidate Department of Genetics Genetic Variation and Genome- Wide Association Studies Keyan Salari, MD/PhD Candidate Department of Genetics How many of you did the readings before class? A. Yes, of course! B. Started, but didn t get

More information

Genetic data concepts and tests

Genetic data concepts and tests Genetic data concepts and tests Cavan Reilly September 21, 2018 Table of contents Overview Linkage disequilibrium Quantifying LD Heatmap for LD Hardy-Weinberg equilibrium Genotyping errors Population substructure

More information

An introduction to genetics and molecular biology

An introduction to genetics and molecular biology An introduction to genetics and molecular biology Cavan Reilly September 5, 2017 Table of contents Introduction to biology Some molecular biology Gene expression Mendelian genetics Some more molecular

More information

MONTE CARLO PEDIGREE DISEQUILIBRIUM TEST WITH MISSING DATA AND POPULATION STRUCTURE

MONTE CARLO PEDIGREE DISEQUILIBRIUM TEST WITH MISSING DATA AND POPULATION STRUCTURE MONTE CARLO PEDIGREE DISEQUILIBRIUM TEST WITH MISSING DATA AND POPULATION STRUCTURE DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate

More information

Crash-course in genomics

Crash-course in genomics Crash-course in genomics Molecular biology : How does the genome code for function? Genetics: How is the genome passed on from parent to child? Genetic variation: How does the genome change when it is

More information

QTL Mapping Using Multiple Markers Simultaneously

QTL Mapping Using Multiple Markers Simultaneously SCI-PUBLICATIONS Author Manuscript American Journal of Agricultural and Biological Science (3): 195-01, 007 ISSN 1557-4989 007 Science Publications QTL Mapping Using Multiple Markers Simultaneously D.

More information

MPG NGS workshop I: SNP calling

MPG NGS workshop I: SNP calling MPG NGS workshop I: SNP calling Mark DePristo Manager, Medical and Popula

More information

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl Next Generation Sequencing Bioinformatics small variants Data Analysis Guidelines genomescan.nl GenomeScan s Guidelines for Small Variant Analysis on NGS Data Using our own proprietary data analysis pipelines

More information

Quality Control Report for Exome Chip Data University of Michigan April, 2015

Quality Control Report for Exome Chip Data University of Michigan April, 2015 Quality Control Report for Exome Chip Data University of Michigan April, 2015 Project: Health and Retirement Study Support: U01AG009740 NIH Institute: NIA 1. Summary and recommendations for users A total

More information

Redefine what s possible with the Axiom Genotyping Solution

Redefine what s possible with the Axiom Genotyping Solution Redefine what s possible with the Axiom Genotyping Solution From discovery to translation on a single platform The Axiom Genotyping Solution enables enhanced genotyping studies to accelerate your research

More information

Topics in Statistical Genetics

Topics in Statistical Genetics Topics in Statistical Genetics INSIGHT Bioinformatics Webinar 2 August 22 nd 2018 Presented by Cavan Reilly, Ph.D. & Brad Sherman, M.S. 1 Recap of webinar 1 concepts DNA is used to make proteins and proteins

More information

Whole Genome Sequencing. Biostatistics 666

Whole Genome Sequencing. Biostatistics 666 Whole Genome Sequencing Biostatistics 666 Genomewide Association Studies Survey 500,000 SNPs in a large sample An effective way to skim the genome and find common variants associated with a trait of interest

More information

UHT Sequencing Course Large-scale genotyping. Christian Iseli January 2009

UHT Sequencing Course Large-scale genotyping. Christian Iseli January 2009 UHT Sequencing Course Large-scale genotyping Christian Iseli January 2009 Overview Introduction Examples Base calling method and parameters Reads filtering Reads classification Detailed alignment Alignments

More information

Supplementary Note: Detecting population structure in rare variant data

Supplementary Note: Detecting population structure in rare variant data Supplementary Note: Detecting population structure in rare variant data Inferring ancestry from genetic data is a common problem in both population and medical genetic studies, and many methods exist to

More information

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang Supplementary Materials for: Detecting very low allele fraction variants using targeted DNA sequencing and a novel molecular barcode-aware variant caller Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John

More information

Human Genetic Variation. Ricardo Lebrón Dpto. Genética UGR

Human Genetic Variation. Ricardo Lebrón Dpto. Genética UGR Human Genetic Variation Ricardo Lebrón rlebron@ugr.es Dpto. Genética UGR What is Genetic Variation? Origins of Genetic Variation Genetic Variation is the difference in DNA sequences between individuals.

More information

Human Genetics and Gene Mapping of Complex Traits

Human Genetics and Gene Mapping of Complex Traits Human Genetics and Gene Mapping of Complex Traits Advanced Genetics, Spring 2018 Human Genetics Series Thursday 4/5/18 Nancy L. Saccone, Ph.D. Dept of Genetics nlims@genetics.wustl.edu / 314-747-3263 What

More information

Haplotype phasing in large cohorts: Modeling, search, or both?

Haplotype phasing in large cohorts: Modeling, search, or both? Haplotype phasing in large cohorts: Modeling, search, or both? Po-Ru Loh Harvard T.H. Chan School of Public Health Department of Epidemiology Broad MIA Seminar, 3/9/16 Overview Background: Haplotype phasing

More information

Enhanced Resolution and Statistical Power Through SNP Distributions Within the Short Tandem Repeats

Enhanced Resolution and Statistical Power Through SNP Distributions Within the Short Tandem Repeats Enhanced Resolution and Statistical Power Through SNP Distributions Within the Short Tandem Repeats John V. Planz, Ph.D. Associate Professor, Associate Director UNT Center for Human Identification UNT

More information

Population Genetics. If we closely examine the individuals of a population, there is almost always PHENOTYPIC

Population Genetics. If we closely examine the individuals of a population, there is almost always PHENOTYPIC 1 Population Genetics How Much Genetic Variation exists in Natural Populations? Phenotypic Variation If we closely examine the individuals of a population, there is almost always PHENOTYPIC VARIATION -

More information

UAB DNA-Seq Analysis Workshop. John Osborne Research Associate Centers for Clinical and Translational Science

UAB DNA-Seq Analysis Workshop. John Osborne Research Associate Centers for Clinical and Translational Science + UAB DNA-Seq Analysis Workshop John Osborne Research Associate Centers for Clinical and Translational Science ozborn@uab.,edu + Thanks in advance You are the Guinea pigs for this workshop! At this point

More information

Module 2: Introduction to PLINK and Quality Control

Module 2: Introduction to PLINK and Quality Control Module 2: Introduction to PLINK and Quality Control 1 Introduction to PLINK 2 Quality Control 1 Introduction to PLINK 2 Quality Control Single Nucleotide Polymorphism (SNP) A SNP (pronounced snip) is a

More information

ARTICLE Haplotype Estimation Using Sequencing Reads

ARTICLE Haplotype Estimation Using Sequencing Reads ARTICLE Haplotype Estimation Using Sequencing Reads Olivier Delaneau, 1 Bryan Howie, 2 Anthony J. Cox, 3 Jean-François Zagury, 4 and Jonathan Marchini 1,5, * High-throughput sequencing technologies produce

More information

Application of Genotyping-By-Sequencing and Genome-Wide Association Analysis in Tetraploid Potato

Application of Genotyping-By-Sequencing and Genome-Wide Association Analysis in Tetraploid Potato Application of Genotyping-By-Sequencing and Genome-Wide Association Analysis in Tetraploid Potato Sanjeev K Sharma Cell and Molecular Sciences The 3 rd Plant Genomics Congress, London 12 th May 2015 Potato

More information

b. (3 points) The expected frequencies of each blood type in the deme if mating is random with respect to variation at this locus.

b. (3 points) The expected frequencies of each blood type in the deme if mating is random with respect to variation at this locus. NAME EXAM# 1 1. (15 points) Next to each unnumbered item in the left column place the number from the right column/bottom that best corresponds: 10 additive genetic variance 1) a hermaphroditic adult develops

More information

Prof. Dr. Konstantin Strauch

Prof. Dr. Konstantin Strauch Genetic Epidemiology and Personalized Medicine Prof. Dr. Konstantin Strauch IBE - Lehrstuhl für Genetische Epidemiologie Ludwig-Maximilians-Universität Institut für Genetische Epidemiologie Helmholtz-Zentrum

More information

Population and Statistical Genetics including Hardy-Weinberg Equilibrium (HWE) and Genetic Drift

Population and Statistical Genetics including Hardy-Weinberg Equilibrium (HWE) and Genetic Drift Population and Statistical Genetics including Hardy-Weinberg Equilibrium (HWE) and Genetic Drift Heather J. Cordell Professor of Statistical Genetics Institute of Genetic Medicine Newcastle University,

More information

Genome wide association studies. How do we know there is genetics involved in the disease susceptibility?

Genome wide association studies. How do we know there is genetics involved in the disease susceptibility? Outline Genome wide association studies Helga Westerlind, PhD About GWAS/Complex diseases How to GWAS Imputation What is a genome wide association study? Why are we doing them? How do we know there is

More information

PUBH 8445: Lecture 1. Saonli Basu, Ph.D. Division of Biostatistics School of Public Health University of Minnesota

PUBH 8445: Lecture 1. Saonli Basu, Ph.D. Division of Biostatistics School of Public Health University of Minnesota PUBH 8445: Lecture 1 Saonli Basu, Ph.D. Division of Biostatistics School of Public Health University of Minnesota saonli@umn.edu Statistical Genetics It can broadly be classified into three sub categories:

More information

Algorithms for Genetics: Introduction, and sources of variation

Algorithms for Genetics: Introduction, and sources of variation Algorithms for Genetics: Introduction, and sources of variation Scribe: David Dean Instructor: Vineet Bafna 1 Terms Genotype: the genetic makeup of an individual. For example, we may refer to an individual

More information

Introduction to Quantitative Genomics / Genetics

Introduction to Quantitative Genomics / Genetics Introduction to Quantitative Genomics / Genetics BTRY 7210: Topics in Quantitative Genomics and Genetics September 10, 2008 Jason G. Mezey Outline History and Intuition. Statistical Framework. Current

More information

Implementing direct and indirect markers.

Implementing direct and indirect markers. Chapter 16. Brian Kinghorn University of New England Some Definitions... 130 Directly and indirectly marked genes... 131 The potential commercial value of detected QTL... 132 Will the observed QTL effects

More information

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer.

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer. DNA Preparation and QC Extraction DNA was extracted from whole blood or flash frozen post-mortem tissue using a DNA mini kit (QIAmp #51104 and QIAmp#51404, respectively) following the manufacturer s recommendations.

More information

Structure, Measurement & Analysis of Genetic Variation

Structure, Measurement & Analysis of Genetic Variation Structure, Measurement & Analysis of Genetic Variation Sven Cichon, PhD Professor of Medical Genetics, Director, Division of Medcial Genetics, University of Basel Institute of Neuroscience and Medicine

More information

Estimation problems in high throughput SNP platforms

Estimation problems in high throughput SNP platforms Estimation problems in high throughput SNP platforms Rob Scharpf Department of Biostatistics Johns Hopkins Bloomberg School of Public Health November, 8 Outline Introduction Introduction What is a SNP?

More information

Genetics and Psychiatric Disorders Lecture 1: Introduction

Genetics and Psychiatric Disorders Lecture 1: Introduction Genetics and Psychiatric Disorders Lecture 1: Introduction Amanda J. Myers LABORATORY OF FUNCTIONAL NEUROGENOMICS All slides available @: http://labs.med.miami.edu/myers Click on courses First two links

More information

Package snpready. April 11, 2018

Package snpready. April 11, 2018 Version 0.9.6 Date 2018-04-11 Package snpready April 11, 2018 Title Preparing Genotypic Datasets in Order to Run Genomic Analysis Three functions to clean, summarize and prepare genomic datasets to Genome

More information

Jean-Simon Brouard 1, Brian Boyle 2, Eveline M. Ibeagha-Awemu 1 and Nathalie Bissonnette 1*

Jean-Simon Brouard 1, Brian Boyle 2, Eveline M. Ibeagha-Awemu 1 and Nathalie Bissonnette 1* Brouard et al. BMC Genetics (2017) 18:32 DOI 10.1186/s12863-017-0501-y RESEARCH ARTICLE Low-depth genotyping-by-sequencing (GBS) in a bovine population: strategies to maximize the selection of high quality

More information

Using VarSeq to Improve Variant Analysis Research

Using VarSeq to Improve Variant Analysis Research Using VarSeq to Improve Variant Analysis Research June 10, 2015 G Bryce Christensen Director of Services Questions during the presentation Use the Questions pane in your GoToWebinar window Agenda 1 Variant

More information

Lecture 23: Causes and Consequences of Linkage Disequilibrium. November 16, 2012

Lecture 23: Causes and Consequences of Linkage Disequilibrium. November 16, 2012 Lecture 23: Causes and Consequences of Linkage Disequilibrium November 16, 2012 Last Time Signatures of selection based on synonymous and nonsynonymous substitutions Multiple loci and independent segregation

More information

Goal: To use GCTA to estimate h 2 SNP from whole genome sequence data & understand how MAF/LD patterns influence biases

Goal: To use GCTA to estimate h 2 SNP from whole genome sequence data & understand how MAF/LD patterns influence biases GCTA Practical 2 Goal: To use GCTA to estimate h 2 SNP from whole genome sequence data & understand how MAF/LD patterns influence biases GCTA practical: Real genotypes, simulated phenotypes Genotype Data

More information

Biology 445K Winter 2007 DNA Fingerprinting

Biology 445K Winter 2007 DNA Fingerprinting Biology 445K Winter 2007 DNA Fingerprinting For Friday 3/9 lab: in your lab notebook write out (in bullet style NOT paragraph style) the steps for BOTH the check cell DNA prep and the hair follicle DNA

More information

BST227 Introduction to Statistical Genetics. Lecture 3: Introduction to population genetics

BST227 Introduction to Statistical Genetics. Lecture 3: Introduction to population genetics BST227 Introduction to Statistical Genetics Lecture 3: Introduction to population genetics!1 Housekeeping HW1 will be posted on course website tonight 1st lab will be on Wednesday TA office hours have

More information