Variant Calling CHRIS FIELDS MAYO-ILLINOIS COMPUTATIONAL GENOMICS WORKSHOP, JUNE 19, 2017

Similar documents
SNP calling and VCF format

Variant calling in NGS experiments

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

Next-Generation Sequencing. Technologies

Alignment. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

Variant Detection in Next Generation Sequencing Data. John Osborne Sept 14, 2012

BGGN 213 Genome Informatics Barry Grant

Variant Analysis. CB2-201 Computational Biology and Bioinformatics! February 27, Emidio Capriotti!

Next Generation Sequencing: Data analysis for genetic profiling

Read Mapping and Variant Calling. Johannes Starlinger

Introduction to Next Generation Sequencing (NGS) Andrew Parrish Exeter, 2 nd November 2017

Welcome to the NGS webinar series

Mining GWAS Catalog & 1000 Genomes Dataset. Segun Fatumo

Personal Genomics Platform White Paper Last Updated November 15, Executive Summary

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech

Sanger vs Next-Gen Sequencing

Amapofhumangenomevariationfrom population-scale sequencing

Bioinformatics Advice on Experimental Design

Germline variant calling and joint genotyping

BST227 Introduction to Statistical Genetics. Lecture 8: Variant calling from high-throughput sequencing data

Axiom mydesign Custom Array design guide for human genotyping applications

CNV and variant detection for human genome resequencing data - for biomedical researchers (II)

Targeted resequencing

Course Presentation. Ignacio Medina Presentation

Reference genomes and common file formats

Genome variation - part 1

RareVariantVis 2: R suite for analysis of rare variants in whole genome sequencing data.

Next Generation Sequencing Lecture Saarbrücken, 19. March Sequencing Platforms

Variant detection analysis in the BRCA1/2 genes from Ion torrent PGM data

Data processing and analysis of genetic variation using next-generation DNA sequencing!

Oral Cleft Targeted Sequencing Project

HLA and Next Generation Sequencing it s all about the Data

Next Generation Sequencing. Target Enrichment

LUMPY: A probabilistic framework for structural variant discovery

Fundamentals of Next-Generation Sequencing: Technologies and Applications

UK Biobank Axiom Array

Cancer Genetics Solutions

A pathogenic mutation was identified in the LDLR gene.

Exploring genomic databases: Practical session "

Data Analysis with CASAVA v1.8 and the MiSeq Reporter

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

The Sentieon Genomics Tools A fast and accurate solution to variant calling from next-generation sequence data

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Ion S5 and Ion S5 XL Systems

Targeted Sequencing Reveals Large-Scale Sequence Polymorphism in Maize Candidate Genes for Biomass Production and Composition

The Sentieon Genomic Tools Improved Best Practices Pipelines for Analysis of Germline and Tumor-Normal Samples

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

Identifying recessive gene candidates with GEMINI

I/O Suite, VCF (1000 Genome) and HapMap

De Novo Assembly of High-throughput Short Read Sequences

RNA-Seq Software, Tools, and Workflows

Outline. General principles of clonal sequencing Analysis principles Applications CNV analysis Genome architecture

Introduc)on to NGS Variant Calling

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Number and length distributions of the inferred fosmids.

Genomic DNA ASSEMBLY BY REMAPPING. Course overview

RADSeq Data Analysis. Through STACKS on Galaxy. Yvan Le Bras Anthony Bretaudeau Cyril Monjeaud Gildas Le Corguillé

Whole genome sequencing in drug discovery research: a one fits all solution?

Why can GBS be complicated? Tools for filtering, error correction and imputation.

Identifying copy number alterations and genotype with Control-FREEC

Ensembl Tools. EBI is an Outstation of the European Molecular Biology Laboratory.

L3: Short Read Alignment to a Reference Genome

Jenny Gu, PhD Strategic Business Development Manager, PacBio

Structural variation. Marta Puig Institut de Biotecnologia i Biomedicina Universitat Autònoma de Barcelona

Introductie en Toepassingen van Next-Generation Sequencing in de Klinische Virologie. Sander van Boheemen Medical Microbiology

Target Enrichment Strategies for Next Generation Sequencing

Mapping strategies for sequence reads

USER MANUAL for the use of the human Genome Clinical Annotation Tool (h-gcat) uthors: Klaas J. Wierenga, MD & Zhijie Jiang, P PhD

Targeted Sequencing Using Droplet-Based Microfluidics. Keith Brown Director, Sales

Tangram: a comprehensive toolbox for mobile element insertion detection

Ion S5 and Ion S5 XL Systems

Figure S1. Unrearranged locus. Rearranged locus. Concordant read pairs. Region1. Region2. Cluster of discordant read pairs, bundle

Implementation of Ion AmpliSeq in molecular diagnostics

Introduction to Bioinformatics

Accelerate High Throughput Analysis for Genome Sequencing with GPU

QIAGEN s NGS Solutions for Biomarkers NGS & Bioinformatics team QIAGEN (Suzhou) Translational Medicine Co.,Ltd

Complementary Technologies for Precision Genetic Analysis

RNA-Seq Workshop AChemS Sunil K Sukumaran Monell Chemical Senses Center Philadelphia

Runs of Homozygosity Analysis Tutorial

Genomics AGRY Michael Gribskov Hock 331

Gene Regulation Solutions. Microarrays and Next-Generation Sequencing

Introduction to RNA-Seq

Derrek Paul Hibar

Unravelling the genetic basis of Mayer-Rokitansky- Küster-Hauser syndrome through whole exome sequencing

Next Generation Sequencing: An Overview

Next Generation Sequence Analysis and Computational Genomics Using Graphical Pipeline Workflows

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Mutations during meiosis and germ line division lead to genetic variation between individuals

Create a Planned Run. Using the Ion AmpliSeq Pharmacogenomics Research Panel Plugin USER BULLETIN. Publication Number MAN Revision A.

De novo Genome Assembly

Answers to additional linkage problems.

Using the Trio Workflow in Partek Genomics Suite v6.6

SVMerge Output File Format Specification Sheet

High Cross-Platform Genotyping Concordance of Axiom High-Density Microarrays and Eureka Low-Density Targeted NGS Assays

Introduction to the UCSC genome browser

Whole Genome Sequencing Services Guide

Biomedical Big Data and Precision Medicine

HHS Public Access Author manuscript Nat Biotechnol. Author manuscript; available in PMC 2012 May 07.

Transcription:

Variant Calling CHRIS FIELDS MAYO-ILLINOIS COMPUTATIONAL GENOMICS WORKSHOP, JUNE 19, 2017

Up-front acknowledgments Many figures/slides come from: GATK Workshop slides: http://www.broadinstitute.org/gatk/guide/events?id=2038 IGV Workshop slides: http://lanyrd.com/2013/vizbi/scdttf/ Denis Bauer (CSIRO): http://www.allpower.de/ Many varied publications

Background Variant calling and use cases Errors vs. actual variants Experimental design (GATK focus) Small variant (SNV/Small Indel) analysis GATK Pipeline Formats encountered within Structural Variation Analysis (SV) Association analysis (briefly)

Variant Calling As the name implies, we re looking for differences (variations) Reference reference genome (hg19, b37) Sample(s) one or more comparative samples Start with raw sequence data End with a human diff file, recording the variants Additional information added downstream: Filters (quality of the calls) Functional annotation

Variations Difference between 2 individuals : 1 every 1000 bp ~ 2.7 million differences Small (<50 bp) SNV single nucleotide Small insertions or deletions ( indels ) Large (structural variations) Indels > 50 bp Copy Number Variations Inversions Translocations

Variations Mainly focus on diploid organisms Human: 22 pairs of autosomal chromosomes One from mother, one from father 2 sex chromosomes (female XX, male XY) One from mother, one from father (where does Y come from for male offspring) Mitochondrial genome (generally maternally inherited) 100-10,000 copies per cell Variation can be in One chromosome (heterozygous, or het ) Both chromosomes (homozygous, or hom )

Use cases Medicine Hereditary or genetic diseases, genetic predisposition to disease Normal vs. tumor analyses Heteroplasmy Population genetics

Population genetics The 1000 Genomes Project The full 1000 Genomes Project data 1,100 samples early 2011; 2,500 samples 2011/12 AJM MXL ASW ACB PEL PUR CLM GBR IBS GWD CEU YRI MAB FIN TSI PJL LWK CDX ADH KAK RDH MRM CHB KHV CHS JPT

Cancer Figure 2 Diagnosis: Multiple leukemic clones present Clinical remission: loss of most leukemic clones Relapse: Acquisition of new mutations in a pre-existing clone HSCs 12.74% 29.04% 5.10% DNMT3A, NPM1, FLT3, PTRPT ETV6, WNK1-WAC, MY018B 53.12% AML1/ UPN933124 Chemotherapy cell type: mutations: normal AML founder (cluster 1) primary specific (cluster 2) relapse enriched (cluster 3) relapse enriched (cluster 4) relapse specific (cluster 5) random mutations in HSCs pathogenic mutations Current Opinion in Genetics & Development Model of the clonal progression process that occurs between the initial (de novo) and relapse presentation in AML patients. At diagnosis, this patient has an oligoclonal disease characterized by four different subclones, each present at a specific proportion in the tumor cell population and with a specific mutational profile. Chemotherapy used to induce the patient into remission decreases clonal heterogeneity but a single subclone persists, acquires new mutations, and again proliferates in the bone marrow as a relapse-specific subclone. E. Mardis, Current Opinion in Genetics & Development 2012, 22:245 250

Heteroplasmy Mixture of more than one type of organelle genome within cell 100-10,000 mtdna copies/cell Predisposing factor for mitochondrial diseases Late (adult) onset Possibly beneficial in some cases Centenarians have above avg freq for heteroplasmy Coble et al, PLOS One, 2009;4(3):e4838

Variants vs. Errors Must distinguish between actual variation (real change) and errors (artifacts) introduced into the analysis Errors can creep in on various levels: PCR artifacts (amplification of errors) Sequencing (errors in base calling) Alignment (misalignment, mis-gapped alignments) Variant calling (low depth of coverage, few samples) Genotyping (poor annotation) Try to control for these when possible to reduce false positives w/o incurring (worse) false negatives

Example: Initial raw sequence data

Sequence quality Different technologies have different errors, error rates 454 homopolymer track errors Illumina substitution errors PacBio indels. Lots and lots of em. Represented as a quality score (Phred) Q = -10log10(e) Phred Quality Score Probability of incorrect base call Base call accuracy 10 1 in 10 90% 20 1 in 100 99% 30 1 in 1000 99.90% 40 1 in 10000 99.99% 50 1 in 100000 ~100.00%

Formats: FASTQ sequence with quality @HWI-ST1155:109:D0L23ACXX:5:1101:2247:1985 1:N:0:GCCAAT NTTCCTTTGACAAATATTAAAATTAAGAATCAAATATGGTAGTGTATGCCAAGACCTAGTCTGAGTCAGTAGGAT + #1=DDFFFHHHHHJJJJJJIJJJIJJIJIJJJJJJJIJI?FHFHEIJEIIIEGFFHHGIGHIJEIFGIJHGDIII Three variants Sanger, Illumina, Solexa (Sanger is most common) May be raw data (straight from seq pipeline) or processed (trimmed for various reasons) Can hold 100 s of millions of records per sample Files can be very large (100 s of GB) apiece

Formats: FASTQ sequence with quality @HWI-ST1155:109:D0L23ACXX:5:1101:2247:1985 1:N:0:GCCAAT NTTCCTTTGACAAATATTAAAATTAAGAATCAAATATGGTAGTGTATGCCAAGACCTAGTCTGAGTCAGTAGGAT + #1=DDFFFHHHHHJJJJJJIJJJIJJIJIJJJJJJJIJI?FHFHEIJEIIIEGFFHHGIGHIJEIFGIJHGDIII Very low Phred score, less than 10

Formats: FASTQ sequence with quality @HWI-ST1155:109:D0L23ACXX:5:1101:2247:1985 1:N:0:GCCAAT NTTCCTTTGACAAATATTAAAATTAAGAATCAAATATGGTAGTGTATGCCAAGACCTAGTCTGAGTCAGTAGGAT + #1=DDFFFHHHHHJJJJJJIJJJIJJIJIJJJJJJJIJI?FHFHEIJEIIIEGFFHHGIGHIJEIFGIJHGDIII Low Phred score, < 20

Formats: FASTQ sequence with quality @HWI-ST1155:109:D0L23ACXX:5:1101:2247:1985 1:N:0:GCCAAT NTTCCTTTGACAAATATTAAAATTAAGAATCAAATATGGTAGTGTATGCCAAGACCTAGTCTGAGTCAGTAGGAT + #1=DDFFFHHHHHJJJJJJIJJJIJJIJIJJJJJJJIJI?FHFHEIJEIIIEGFFHHGIGHIJEIFGIJHGDIII High quality reads, Phred score > 30

How do sequencing errors occur?

Illumina Sequencing

Illumina Sequencing

Illumina Sequencing

Illumina Sequencing

Check sequence data!

Basic Experimental Design

Terminology Lane Physical sequencing lane Library Unit of DNA prep pooled together Sample Single individual Cohort Collection of samples analyzed together lane flowcell

Terminology Sample (single) vs Cohort (multiple) Library Lane Flowcell flowcell Library lane In general, Library = Sample

Terminology WGS vs Exome Capture Whole genome sequencing everything High cost if per sample is deep sequence (>25-30x) Can run multisample low coverage samples Exome capture targeted sequencing (1-5% of genome) Deeper coverage of transcribed regions Miss other important non-coding regions (promoters, introns, enhancers, small RNA, etc)

Single*vs.*mul)6sample*analysis* Fixed*sequencing*budget* Deep%singleGsample% Shallower%mulJGsample% Sample 1" Sample 1" Sample 2" Sample 3" Sample 4" Variants" Found 3 variants total" Higher*sensi)vity*for*variants* in*the*sample* More*accurate*genotyping*per* sample* Cost:*no*informa)on*about* other*samples* Found 5 variants total" Sensi)vity*dependent*on* frequency*of*varia)on* Worse*genotyping* More*total*variants* discovered*

High6pass*sequencing*design* Intergenic" Exon I" Intron I" Variant site" Exon II" Intergenic" ~30x reads" Excellent sensitivity for hetero- and homozygotes" High depth allows excellent genotype calling" Data requirements per sample Variant detection among multiple samples Target bases 3 Gb Variants found per sample ~3-5M Coverage Avg. 30x Percent of variation in genome >99% # sequenced bases 100 gb Pr{singleton discovery} >99% # per lane (HiSeq 4000) ~1 Pr{common allele discovery} >99%

Low6pass*sequencing*design* Intergenic" Exon I" Intron I" Variant site" Exon II" Intergenic" ~4x reads" Heterozygotes can be mistaken for homozygotes due to sampling" Variants missed by sampling" Significantly better power to detect homozygous sites" Data requirements per sample Variant detection among multiple samples Target bases 3 Gb Variants found per sample ~3M Coverage Avg. 4x Percent of variation in genome ~90% # sequenced bases 20 gb Pr{singleton discovery} < 50% # per lane (HiSeq 4000) ~6 Pr{common allele discovery} ~99%

Exome Capture

Exome Capture (TruSeq Exome Capture)

Exome*capture*sequencing*design* Intergenic" Exon I" Intron I" Exon II" Intergenic" Variant site" 150x reads" Little off-target coverage" Data requirements per sample Variant detection among multiple samples Target bases 45 Mb Variants found per sample ~25,000 Coverage >80% 20x Percent of variation in genome 0.005 # sequenced bases 5 Gb Pr{singleton discovery} ~95% # per lane (HiSeq 4000) 24 Pr{common allele discovery} ~95%

General variant calling pipelines Common pattern: Align reads Optimize alignment Call variants Filter called variants Annotate

Pipeline examples Examples: Broad Genome Analysis Toolkit (GATK) samtools mpileup VarScan2 Specialized pipelines Heteroplasmy, tumor sample analyses

GATK Pipeline

Phase I : NGS Data Processing

Phase I NGS Data Processing Alignment of raw reads Duplicate marking Base quality recalibration Local realignment no longer required if you use the HaplotypeCaller

Phase I : Alignment of raw reads Accuracy Sensitivity maps reads accurately, allowing for errors or variation Specificity maps to the correct region Heng Li s aligner assessment

Phase I : Alignment of raw reads Accuracy assessed using simulated data Generally, BWA-MEM or Novoalign are recommended Unique vs. multi-mapped reads Should we retain reads mapping to repetitive regions? May depend on the application Heng Li s aligner assessment

Phase I : Alignment of raw reads Almost 100 of these (2016) : http://wwwdev.ebi.ac.uk/fg/hts_mappers/

Mapping'short'reads'to'a'reference'is'simple'in'principle' Enormous)pile)of)short) reads)from)ngs) Iden;fy'where'the'read'matches' the'reference'sequence'and'record' match'details'as'cigar'string' Mapping'and' alignment' algorithms' Region'1' Region'2' Region'3' Reference) genome) Reads' mapped'to' reference'

But mapping is actually very hard because of mismatches (true mutations or sequencing errors), duplicated regions etc.! Enormous)pile)of)short) reads)from)ngs) Mapping'and' alignment' algorithms' Mapping'algorithms'account'' for'this'by'choosing'the'most'' likely'placement) )mapping)quality)(mq)) Region'1' Region'2A' Region'2B' Reference) genome) For'more'informa;on'see:' Li'and'Homer'(2010).'A'survey'of' sequence'alignment'algorithms'for' nextbgenera;on'sequencing.'' Briefings)in)Bioinforma.cs.) High'MQ' Low'MQ'

Alignment output : SAM/BAM SAM Sequence Alignment/Map format SAM file format stores alignment information Normally converted into BAM (text format is mostly useless for analysis) Specification: http://samtools.sourceforge.net/sam1.pdf Contains FASTQ reads, quality information, meta data, alignment information, etc. Files are typically very large: Many 100 s of GB or more

Alignment output : SAM/BAM BAM BGZF compressed SAM format May be unsorted, or sorted by sequence name or genome coordinates May be accompanied by an index file (.bai) (only if coord-sorted) Makes the alignment information easily accessible to downstream applications (large genome file not necessary) Relatively simple format makes it easy to extract specific features, e.g. genomic locations BAM is the compressed/binary version of SAM and is not human readable. Uses a specialize compression algorithm optimized for indexing and record retrieval (bgzip) Files are typically very large: 1/5 of SAM, but still very large

Alignment output : SAM/BAM Alignment SAM format

Alignment output : SAM/BAM SAM format

SAM format

SAM format

Bit Flags Hex 0x80 0x40 0x20 0x10 0x8 0x4 0x2 0x1 Bit 128 64 32 16 8 4 2 1 r001 1 1 1 1 = 163

SAM format

SAM format

SAM format

CIGAR

SAM format

SAM format

SAM format

SAM format

Alignment output : SAM/BAM Too many to go over!!!

Alignment output : SAM/BAM Tools samtools Picard Mining information from a properly formatted BAM file: Reads in a region (good for RNA-Seq, ChIP-Seq) Quality of alignments Coverage and of course, differences (variants)

Phase I : Duplicate Marking The'reason'why'duplicates'are'bad' ='sequencing'error'propagated'in'duplicates' Reference) genome) Reads' mapped'to' reference' FP)variant)call) (bad)) ATer)marking)duplicates,)the)GATK)will)only)see):) )and)thus)be)more)likely)to)make)the)right)call)

Phase I : Duplicate Marking Duplicates'have'the'same'star;ng'posi;on'' and'the'same'cigar'string' Reference) genome) Reads' mapped'to' reference' Easy)to)bag)&)tag) Hey,)Picard)has)an)app)for)that!)

Phase I : Sorting, Read Groups A'quick'diversion'about'sor;ng'and'read'groups' The)informaKon)for)this:) )is)actually)stored)as)a) text)file)with)one)line)per) read)which)from)far)away) looks)like)this:) )but)the)gatk)wants) reads)to)be)sorted)by) starkng)posikon)like)this:) The)reads)are)in)no)parKcular) order ) So)we)need)to)explicitly)sort)the) SAM)file ) )and)picard)has)an)app)for)that!) And'while'we re'at'it,'let s'add'read)group)informa;on'if'it'isn t'already' there,'so'the)gatk)will)know)what)read)belongs)to)what)sample'(that s' kind'of'important).' Hey,)Picard)has)an)app)for)that)too!)

Terminology Read groups information about the samples and how they were run ID Simple unique identifier Library Sample name Platform sequencing platform Platform unit barcode or identifier Sequencing center (optional) Description (optional) Run date (optional)

Phase I : Sorting, Read Groups Typical'workflow'using'Picard'tools'to'mark'duplicates'et)al.) java)^jar)sortsam.jar)\) )I=input.sam)\) )O=output.bam)\) )SO=coordinate) Original)SAM) sort' Ordered)BAM) Wham,)BAM,)thank)you)Picard!) java)^jar)markduplicates.jar)\) )I=input.sam)\) )O=output.bam) dedup ' Final)BAM) java)^jar)addorreplacereadgroups.jar)\) )I=input.bam)\) )O=output.bam)\) )RGID=id)RGLB=solexa^123)\) )RGPL=illumina)RGPU=AXL2342)\) )RGSM=NA12878)RGCN=bi)\) )RGDT=12/12/2011) add'rg' Dedupped)BAM)

Phase I : Local Realignment Realign around indels Common misalignment problem Can cause problems with base quality recalibration, variant calls

This*is*what*a*realigned*BAM*looks*like* Before* Afer* DePristo, M., Banks, E., Poplin, R. et. al, A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Gen.!

Phase I : Base Quality Score Recalibration Quality scores from sequencers are biased and somewhat inaccurate Quality scores are critical for all downstream analysis Biases are a major contributor to bad variant calls Caveat: In practice, generally requires having a known set of variants (dbsnp)

Phase II : Variant Discovery/Genotyping

Phase II : Variant Calling This is where we actually call the variants Prior steps leading up to this help remove potential causes of variant calling errors I ll be covering use of the UnifiedGenotyper variant calling tool in GATK but you should keep an eye on the newest tool, HaplotypeCaller

Phase II : Variant Calling In general, uses a probabilistic method, e.g. Bayesian model Determine the possible SNP and indel alleles Only good bases are included: Those satisfying minimum base quality, mapping read quality, pair mapping quality, etc. Compute, for each sample, for each genotype, likelihoods of data given genotypes Compute the allele frequency distribution to determine most likely allele count; emit a variant call if determined If we are going to emit a variant, assign a genotype to each sample

Hom REF: 0% ALT: 100% Het REF: 50% ALT: 50%?? REF: 77% ALT: 23%

Variant calling output: VCF VCF (Variant Call Format) Like SAM/BAM, also has a versioned specification From the 1000 Genomes Project http://www.1000genomes.org/wiki/analysis/variant%20call%20format/vcf-variantcall-format-version-41

Formats: VCF ##fileformat=vcfv4.1 ##filedate=20090805 ##source=myimputationprogramv3.1 ##reference=file:///seq/references/1000genomespilot-ncbi36.fasta ##contig=<id=20,length=62435964,assembly=b36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0 0:48:1:51,51 1 0:48:8:51,51 1/1 20 17330. T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0 0:49:3:58,50 0 1:3:5:65,3 0/0 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1 2:21:6:23,27 2 1:2:0:18,2 2/2 20 1230237. T. 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0 0:54:7:56,60 0 0:48:4:51,51 0/0 20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1

Formats: VCF ##fileformat=vcfv4.1 ##filedate=20090805 ##source=myimputationprogramv3.1 ##reference=file:///seq/references/1000genomespilot-ncbi36.fasta ##contig=<id=20,length=62435964,assembly=b36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0 0:48:1:51,51 1 0:48:8:51,51 1/1 20 17330. T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0 0:49:3:58,50 0 1:3:5:65,3 0/0 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1 2:21:6:23,27 2 1:2:0:18,2 2/2 20 1230237. T. 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0 0:54:7:56,60 0 0:48:4:51,51 0/0 20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1

Formats: VCF ##fileformat=vcfv4.1 ##filedate=20090805 ##source=myimputationprogramv3.1 ##reference=file:///seq/references/1000genomespilot-ncbi36.fasta ##contig=<id=20,length=62435964,assembly=b36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0 0:48:1:51,51 1 0:48:8:51,51 1/1 20 17330. T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0 0:49:3:58,50 0 1:3:5:65,3 0/0 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1 2:21:6:23,27 2 1:2:0:18,2 2/2 20 1230237. T. 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0 0:54:7:56,60 0 0:48:4:51,51 0/0 20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1

Formats: VCF ##fileformat=vcfv4.1 ##filedate=20090805 ##source=myimputationprogramv3.1 ##reference=file:///seq/references/1000genomespilot-ncbi36.fasta ##contig=<id=20,length=62435964,assembly=b36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0 0:48:1:51,51 1 0:48:8:51,51 1/1 20 17330. T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0 0:49:3:58,50 0 1:3:5:65,3 0/0 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1 2:21:6:23,27 2 1:2:0:18,2 2/2 20 1230237. T. 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0 0:54:7:56,60 0 0:48:4:51,51 0/0 20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1

Formats: VCF ##fileformat=vcfv4.1 ##filedate=20090805 ##source=myimputationprogramv3.1 ##reference=file:///seq/references/1000genomespilot-ncbi36.fasta ##contig=<id=20,length=62435964,assembly=b36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0 0:48:1:51,51 1 0:48:8:51,51 1/1 20 17330. T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0 0:49:3:58,50 0 1:3:5:65,3 0/0 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1 2:21:6:23,27 2 1:2:0:18,2 2/2 20 1230237. T. 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0 0:54:7:56,60 0 0:48:4:51,51 0/0 20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1

Formats: VCF ##fileformat=vcfv4.1 ##filedate=20090805 ##source=myimputationprogramv3.1 ##reference=file:///seq/references/1000genomespilot-ncbi36.fasta ##contig=<id=20,length=62435964,assembly=b36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0 0:48:1:51,51 1 0:48:8:51,51 1/1 20 17330. T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0 0:49:3:58,50 0 1:3:5:65,3 0/0 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1 2:21:6:23,27 2 1:2:0:18,2 2/2 20 1230237. T. 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0 0:54:7:56,60 0 0:48:4:51,51 0/0 20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1

Formats: VCF ##fileformat=vcfv4.1 ##filedate=20090805 ##source=myimputationprogramv3.1 ##reference=file:///seq/references/1000genomespilot-ncbi36.fasta ##contig=<id=20,length=62435964,assembly=b36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0 0:48:1:51,51 1 0:48:8:51,51 1/1 20 17330. T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0 0:49:3:58,50 0 1:3:5:65,3 0/0 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1 2:21:6:23,27 2 1:2:0:18,2 2/2 20 1230237. T. 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0 0:54:7:56,60 0 0:48:4:51,51 0/0 20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1

Formats: VCF ##fileformat=vcfv4.1 ##filedate=20090805 ##source=myimputationprogramv3.1 ##reference=file:///seq/references/1000genomespilot-ncbi36.fasta ##contig=<id=20,length=62435964,assembly=b36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0 0:48:1:51,51 1 0:48:8:51,51 1/1 20 17330. T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0 0:49:3:58,50 0 1:3:5:65,3 0/0 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1 2:21:6:23,27 2 1:2:0:18,2 2/2 20 1230237. T. 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0 0:54:7:56,60 0 0:48:4:51,51 0/0 20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1

Formats: VCF ##fileformat=vcfv4.1 ##filedate=20090805 ##source=myimputationprogramv3.1 ##reference=file:///seq/references/1000genomespilot-ncbi36.fasta ##contig=<id=20,length=62435964,assembly=b36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> Samples ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0 0:48:1:51,51 1 0:48:8:51,51 1/1 20 17330. T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0 0:49:3:58,50 0 1:3:5:65,3 0/0 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1 2:21:6:23,27 2 1:2:0:18,2 2/2 20 1230237. T. 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0 0:54:7:56,60 0 0:48:4:51,51 0/0 20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1

VCF Col Field Description 1 CHROM Chromosome name 1-based position. For an indel, this is the position preceding the 2 POS indel. 3 ID Variant identifier. Usually the dbsnp rsid. Reference sequence at POS involved in the variant. For a SNP, it is a 4 REF single base. 5 ALT Comma delimited list of alternative seuqence(s). Phred-scaled probability of all samples being homozygous 6 QUAL reference. 7 FILTER Semicolon delimited list of filters that the variant fails to pass. 8 INFO Semicolon delimited list of variant information. Colon delimited list of the format of individual genotypes in the 9 FORMAT following fields. 10+ Sample(s) Individual genotype information defined by FORMAT.

Phase II : Filtering Two basic methods: Hard filtering Variant quality score recalibration (VQSR)

Strand Bias Phase II : Hard Filtering Reducing false positives by e.g. requiring Sufficient Depth Variant to be in >30% reads High quality Strand balance Etc etc etc Very high dimensional search space so, very subjective!

Strand Bias Phase II : Hard Filtering The effect of strand bias in Illumina short-read sequencing data Guo et al., BMC Genomics 2012, 13:666 Last sentence: Indiscriminant use of strand bias as a filter will result in a large loss of true positive SNPs

Phase II : Variant Quality Score Recalibration (VQSR) Considered GATK best practice Train on trusted variants Require the new variants to live in the same hyperspace Potential problems: Over-fitting Biasing to features of known SNPs

Phase III : Integrative Analysis

Phase III : Functional Annotation Are these mutations in important regions? Genes? UTR? Are they changing the coding sequence? Would these changes have an affect? Tools: SnpEff/SnpSift Annovar

The end of the (pipe)line

Follow-up Quality Control Transition/Transversion ratio (T i /T v ) Condition Expected T i /T v random 0.5 whole genome 2.0-2.1 exome 3.0-3.5 bcftools can help here Concordance with known variants: dbsnp, HapMap, 1000genomes Others?

Calling variants on cohorts of samples When running HaplotypeCaller, can use a specific output type called GVCF (Genotype VCF) Contains genotype likelihood and annotation for each site in genome Perform joint genotyping calls on cohort Can rerun as needed if more samples added to cohort Used for ExAC cohort (92K exomes)

Calling variants on cohorts of samples

Structural Variation Tools like GATK, samtools can t currently detect larger structural changes easily Alkan et al, Nature Genetics 12:363, 2011

Structural Variation With the latest releases of GATK (v3.7 or higher), this is changing: https://software.broadinstitute.org/gatk/best-practices/

Structural Variation Detection using NGS data generally requires multi-layer analyses, may focus on specific SV types Common tools: CNVnator gross detection of CNVs BreakDancer, Cortex breakpoint detection Pindel large deletions Hydra-SV read pair discordance Recent tools (lumpy-sv, GASVPRO) integrate approaches

Structural Variation Strategies Read pair discordance Insert size is off, orientation of reads is wrong Read depth Region deviates from expected read depth Split reads Single read is split, parts align in two distinct unique locations Assembly Reference-based local assemblies indicate inconsistencies

Structural Variation Analyses Ref Ref

Structural Variation Analyses

Structural Variation Still an active area of research Problems: Lots of false positives Hard to compare methodologies

Association Studies Genome-wide association studies (GWAS) Trying to determine whether specific variant(s) in many individuals can be associated with a trait Ex: comparison of groups of people with a disease (cases) and without (controls)

Finding the causal variant in ideal situations* Spot the variant that is common amongst all affected but absent in all unaffected This variant is in a gene with known function and causes the protein to be disrupted * e.g. some rare autosomal disease

In reality You can t spot the difference You deal with ~3.5 million SNPs You need to employ methods that systematically identify variants that stand out: GWAS GWAS taught us that it is unlikely to find a causal common variant for complex diseases Rare Variant? A bunch of rare and common variants? An even more complex model? 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010 Oct 28;467(7319):1061-73. PubMed PMID: 20981092

GWAS Criticisms of initial GWAS studies Poor controls Severely underpowered Tons of false-positives Not worth the cost Led to some retractions Wikipedia article outlines the limitations very well http://en.wikipedia.org/wiki/genomewide_association_study

GWAS Most of these use genotyping arrays Use of NGS may help address some of the past limitations

GATK v4 Out now in pre-release ~5x faster with HaplotypeCaller Completely open-source, but has specific software requirements https://software.broadinstitute.org/gatk/download/alpha