1000 Genomes project: from mapping reads to de novo muta6ons

Size: px

Start display at page:

Download "1000 Genomes project: from mapping reads to de novo muta6ons"

Hollie Ariel Austin
6 years ago
Views:

1 1000 Genomes project: from mapping reads to de novo muta6ons Mark A. DePristo Manager, Genome Sequencing and Analysis Group Medical and Popula6on Gene6cs Program Broad Ins6tute of Harvard and MIT December 3, 2009

Acknowledgments Quality score recalibra6on Local realignment Varia6on discovery De novo muta6ons Anthony Philippakis Andrew Kernytsky MaQ

Awadalla at the Sanger Other contributors The en6re genome sequencing and analysis group Especially the GSA sozware engineering team: MaQ

The SAM/BAM working group: Bob Handsaker, Tim Fennell, Heng Li, and Richard Durbin The cancer genome analysis group: Gad Getz, Kris6an

2 Acknowledgments Quality score recalibra6on Local realignment Varia6on discovery De novo muta6ons Anthony Philippakis Andrew Kernytsky MaQ Hanna Eric Banks Andrey Sivachenko Jared Maguire Kiran Garimella Manny Rivas Michael Melgar Eric Banks Andrew Kernytsky MaQ Hurles and Philip Awadalla at the Sanger Other contributors The en6re genome sequencing and analysis group Especially the GSA sozware engineering team: MaQ Hanna and Aaron McKenna MPG directorship: Stacey Gabriel, David Altshuler, Mark Daly Carrie Sougnez, produc6on teams and folks at 320 and 7CC The SAM/BAM working group: Bob Handsaker, Tim Fennell, Heng Li, and Richard Durbin The cancer genome analysis group: Gad Getz, Kris6an Cibulskis, Andrey Sivachenko The IGV team: Jim Robinson and Helga Thorvaldsdo`r Produc6on informa6cs: Tim Fennell and Alec Wysoker The 1000 genomes project

3 Agenda Introduc6on to the 1000 genomes project Mapping and alignment SAM/BAM format Visualizing the data The Genome Analysis Toolkit The infrastructure suppor6ng our tools for working with next genera6on sequencing data Tools developed in the GATK for calling SNPs and indels in the 1000 genomes pilot

4 The 1000 genomes project is characterizing common gene6c varia6on with MAF >1% in three popula6ons Pilot 1: Pilot 1: ~150 individuals whole genome Applies a mul6 sample sequenced to 4x depth generaliza6on of the single sample approach in pilot 2 Data produc6on and analysis ~ 17M SNPs Method not discussed in detail ~ 2 10M short indels Pilot 2: Two children and their parents whole genome sequence to ~70x Data produc6on and analysis ~ 3 5M SNPs ~ K short indels Pilot 3: Pilot 3: Applies the same SNP and 1000 genes in ~400 individuals to ~50x depth indel calling methods as Data produc6on and analysis Pilot 2 ~ 10K SNPs Method not discussed in detail ~ 1000 short indels

5 Data for the project comes from many centers and several technologies Added for produc6on phase For pilot phase only Slide courtesy of Carrie Sougnez

6 The pilot phase alone has generated ~5 Tb of sequence Pilot 1 Pilot 2 Pilot 3 Total Number of Samples Illumina SOLID Total Slide courtesy of Carrie Sougnez

7 Agenda Introduc6on to the 1000 genomes project Mapping and alignment SAM/BAM format Visualizing the data The Genome Analysis Toolkit The infrastructure suppor6ng our tools for working with next genera6on sequencing data Tools developed in the GATK for calling SNPs and indels in the 1000 genomes pilot

From unmapped reads to true gene6c varia6on in next genera6on sequencing data Solexa SOLiD 454 Raw short reads Mapping and alignment Region 1 Region 2 Human

calibra6on and annota6on Iden6fying gene6c varia6on Region 1 Region 2 Region 1 Region 2 Human reference genome Human reference genome SNP The quality of each read

8 From unmapped reads to true gene6c varia6on in next genera6on sequencing data Solexa SOLiD 454 Raw short reads Mapping and alignment Region 1 Region 2 Human reference genome A single run of a sequencer generates ~50M ~75bp short reads for analysis The origin of each read from the human genome sequence is found Quality calibra6on and annota6on Iden6fying gene6c varia6on Region 1 Region 2 Region 1 Region 2 Human reference genome Human reference genome SNP The quality of each read is calibrated and addi6onal informa6on annotated for downstream analyses SNPs and indels from the reference are found where the reads collec6vely provide evidence of a variant

Finding the true origin of each read is a computa6onally demanding and important first step Region 1 Region 2 Region 3 Reference genome Enormous pile of short

flags them as uncertain Solexa : MAQ 454 : SSAHA SOLiD : Corona Robust, accurate gold standard aligner for NGS Developed by Li and Durbin Soon to be replaced

9 Finding the true origin of each read is a computa6onally demanding and important first step Region 1 Region 2 Region 3 Reference genome Enormous pile of short reads from NGS Mapping and alignment algorithm Detects correct read origin and flags them with high certainty Detects ambiguity in the origin of reads and flags them as uncertain Solexa : MAQ 454 : SSAHA SOLiD : Corona Robust, accurate gold standard aligner for NGS Developed by Li and Durbin Soon to be replaced by BWA, also by Li and Durbin Hash based aligner with high sensi6vity and specificity with longer reads ABI designed tool for aligning in color space SAM/BAM files

10 The SAM file format Data sharing was a major issue with the 1000 genomes Each center, technology and analysis tool used its own idiosyncra6c file formats no one could exchange data The Sequence Alignment and Mapping (SAM) file format was designed to capture all of the cri6cal informa6on about NGS data in a single indexed and compressed file Becoming a standard and is now used by produc6on informa6cs, MPG, and cancer analysis groups at the Broad Has enabled sharing of data across centers and the development of tools that work across plaporms More info at hqp://samtools.sourceforge.net/

11 What does the data actually look like? chr5:112mb 454 This is a screenshot of IGV All the 1000 genomes data can be viewed easily with IGV hqp:// SLX SOLid Coverage Non reference bases Individual reads

12 Agenda Introduc6on to the 1000 genomes project Mapping and alignment SAM/BAM format Visualizing the data The Genome Analysis Toolkit The infrastructure suppor6ng our tools for working with next genera6on sequencing data Tools developed in the GATK for calling SNPs and indels in the 1000 genomes pilot

13 The GATK is a structured programming framework that aims to simplify wri6ng analysis tools for resequencing data The framework is designed to support most common paradigms of analysis algorithms Provides structured access to reads in SAM format, reference context, as well as reference associated meta data General purpose Op6mized for ease of use and completeness of func6onality within scope Efficient Engineering investment on performance of cri6cal data structures and manipula6on rou6nes Convenient Structured plug in model makes developing against the framework rela6vely painfree

14 The func6onal programming paradigm The GATK follows a common func6onal programming paradigm called map and reduce reduce( g, map( f, list ), init ) ## python Object result = init; // java for ( List x: list ) result = g( result, f(x) ); (reduce g (map f list)) ;; scheme

15 The map / reduce framework Data elements f(x) X = f(x) r(x,y,, z) R = r(a, R(B,,E)) a b c d e A B C D E R Opera6ons are independent of each other Results depends on all sites Result is: Map Reduce Func6on f applied to each element of list Func6on r recursively reduced over each f( )

16 Many algorithms fit within the Map/Reduce framework Idea behind Map/Reduce is to provide structured traversal and access to data Separate problems of accessing data from calcula6ons on the elements in the data Developers can provide powerful, intelligent, efficient traversal engines that implement the map opera6on Analysts can easily write func6ons to analyze their data, and then map them across the data Google popularized map/reduce see Dean and Ghemawat, OSDI'04: Sixth Symposium on Opera6ng System Design and Implementa6on Becoming so popular there was a New York Times ar6cle about it on Tuesday, March 17 th, 2009!

Map/Reduce over the genome Fundamental data dbsnp exons Reference metadata Reference genome Reads, maybe aligned Reference Reads Metadata Reference genome in fasta

17 Map/Reduce over the genome Fundamental data dbsnp exons Reference metadata Reference genome Reads, maybe aligned Reference Reads Metadata Reference genome in fasta format SAM format reads Some traversal types may required reads to be aligned (by locus, for example) Data associated with posi6ons on the reference genome E.g., dbsnp, exons

18 Map/Reduce by read dbsnp exons Reference metadata Reference genome Reads, maybe aligned f (single read, covered reference seq, covered metadata) Evaluated over each read, with reduce accumulating x results at ever read x

Map/Reduce by loci dbsnp exons i j k l m Reference metadata Reference genome Reads, maybe aligned f (all reads cover locus, indices into reads yielding

19 Map/Reduce by loci dbsnp exons i j k l m Reference metadata Reference genome Reads, maybe aligned f (all reads cover locus, indices into reads yielding equivalent positions covered reference seq, covered metadata) Evaluated over each locus in the genome, with reduce accumulating x results at ever locus x

20 The Genome Analysis Toolkit (GATK) enables rapid development of efficient and robust analysis tools Genome Analysis Toolkit (GATK) infrastructure Traversal engine Analysis tool Supports any BAMcompa6ble aligner All of these tools have been developed in the GATK They are memory and CPU efficient, cluster friendly and are easily parallelized They are now publically and are being used at many sites around the world Ini6al alignment MSA realignment Q score recalibra6on Single sample genotyping SNP filtering Provided by framework Implemented by user More info: hqp://

21 The GATK engine already supports many advanced features

Pileup with dbsnp Code: org/broadins6tute/s6ng/gatk/walkers/pileup.java package, imports, etc.

map(list<referenceordereddatum> roddata, char ref, LocusContext context) { String bases = ""; String quals = " ; for

get(i); int offset = context.getoffsets().get(i); bases += read.getreadstring().charat(offset); quals += read.

charat(offset); } Build bases and quals strings String rodstring = ""; for ( ReferenceOrderedDatum datum : roddata )

22 Pileup with dbsnp Code: org/broadins6tute/s6ng/gatk/walkers/pileup.java package, imports, etc. removed for presenta6on public class DepthOfCoverageWalker extends LociWalker<Integer, Integer>{ public Integer map(list<referenceordereddatum> roddata, char ref, LocusContext context) { String bases = ""; String quals = " ; for ( int i = 0; i < context.getreads().size(); i++ ) { SAMRecord read = context. getreads().get(i); int offset = context.getoffsets().get(i); bases += read.getreadstring().charat(offset); quals += read.getbasequalitystring().charat(offset); } Build bases and quals strings String rodstring = ""; for ( ReferenceOrderedDatum datum : roddata ) { if ( datum!= null && datum instanceof roddbsnp) { roddbsnp dbsnp = (roddbsnp)datum; rodstring = "[ROD: + dbsnp.tomediumstring() + ] ; } } System.out.printf("%s: %s %s %s %s%n", context.getlocation(), ref, bases, quals, rodstring); return 1; } } Build the dbsnp string

Pileup with dbsnp II CPU 6me Max. memory 10 secs 1 GB Command Analysis name java -jar dist/genomeanalysistk.

bam Reads -R /seq/references/homo_sapiens_assembly18/v0/homo_sapiens_assembly18.

rod Output dbsnp track Sort order is: coordinate chr1:559844: C

>*95)> chr1:559845: A AAAGACAAAAAAAAGAAAAAAAAAAAAAACAAAAAAAATAAAAAAAAAAAAAAA,>?&*(5(((8(??)@(>4@2<, 1>=9;8)30<)463((=,4?

:<>8>=3/1(> [ROD: chr1:559846-559847:rs2096047:a/g:snp:hapmap:2hit] chr1:559847: A

23 Pileup with dbsnp II CPU 6me Max. memory 10 secs 1 GB Command Analysis name java -jar dist/genomeanalysistk.jar T Pileup -I /broad/1kg/legacy_data/tcga-freeze3/tcga-freeze3-normal.bam Reads -R /seq/references/homo_sapiens_assembly18/v0/homo_sapiens_assembly18.fasta -L chr1:559, ,848 -DBSNP /humgen/gsa-scr1/gatk_data/dbsnp_129_hg18.rod Output dbsnp track Sort order is: coordinate chr1:559844: C CCCCCCCCTGGCTCCCCCCCCCAGCCCTCCCCCCCACCCCCCCACCCCCCCCCCCCCCC 4;6@@2;?&'(8(-00=??6@31)@)<).@?6? 3/18?(=833.;(<?:@?9?>*95)> chr1:559845: A AAAGACAAAAAAAAGAAAAAAAAAAAAAACAAAAAAAATAAAAAAAAAAAAAAA,>?&*(5(((8(??)@(>4@2<, 1>=9;8)30<)463((=,4?;??9>>*:5.> chr1:559846: G AGAACAAAGAAAAAAACGAAAAGGCTAAGTAAAAAACGGGGGGGGGGGGG *&((5,((@?)@(5)?1;,.><>:.)50<#7/),(=/ 9?:<>8>=3/1(> [ROD: chr1: :rs :a/g:snp:hapmap:2hit] chr1:559847: A AAAAAAAAAAAAAAAAAAAAAAAAACAAATAAAAAAAAAAAAAA 4:=@?)?(30@);).>>>:81>8<0#>09*>,4?>@>6>=7(3> chr1:559848: A AAAAAAAAAAAAAAACAAAGAAATAAAAAAAAAAACAAAA )@()0@)=).9>1:7)>-<#4>)(>=/1??<>6>=659)> [PROGRESS] Traversed 81 loci in 9.98 secs ( secs per 1M loci) Traversal reduce result is 5 Ref chr1: is a heterozygous A/G site, consistent with hapmap

24 Tree reduce parallelism framework Thread Single thread work unit Tree reduce thread 1 MAP REDUCE MAP REDUCE REDUCE 2 MAP REDUCE MAP REDUCE REDUCE 3 MAP REDUCE MAP REDUCE REDUCE 4 MAP REDUCE MAP REDUCE

Automa6c paralleliza6on in the GATK ExecuFon Fme (walk Fme (s)) 4000 3000 2000 1000 0 4000 3000 2000 1000 0 1 10 100 0 10 20 30 40 50 60 70 Number of parallel tasks SMP,

25 Automa6c paralleliza6on in the GATK ExecuFon Fme (walk Fme (s)) Number of parallel tasks SMP, single machine Distributed processing: 1 thread per node Distributed processing 4 threads per node Single sample genotyper on chr20 30x SLX reads for NA12878 (1000 genomes)

26 Ge`ng and using the GATK Visit our wiki hqp:// Has developer documents describing how to build the system and read the hello reads tutorial Download binary Jar as well as publically available tools Check out source from SVN repository: hqps://svnrepos.broadins6tute.org/s6ng/

Core GATK development team Mark DePristo depristo@broad MaQhew

feedback, bug reports, feature requests, brainstorming sessions,

Please understand that the system is in ac6ve development, it s

27 Core GATK development team Mark DePristo MaQhew Hanna Aaron McKenna We are looking for feedback, bug reports, feature requests, brainstorming sessions, etc. to make the system as powerful and easy to use as possible Please understand that the system is in ac6ve development, it s usable but interfaces, func6onality, etc., are con6nuously changing and improving

28 Agenda Introduc6on to the 1000 genomes project Mapping and alignment SAM/BAM format Visualizing the data The Genome Analysis Toolkit The infrastructure suppor6ng our tools for working with next genera6on sequencing data Tools developed in the GATK for calling SNPs and indels in the 1000 genomes pilot

Mul6ple sequence realignment Read by read mapping introduces ar6facts that can only be resolved by examining mul6ple reads within their local context Ini6al alignment MSA realignment Inconsistent

29 Mul6ple sequence realignment Read by read mapping introduces ar6facts that can only be resolved by examining mul6ple reads within their local context Ini6al alignment MSA realignment Inconsistent indels Ref: AAGCGTCGAT Read1: AAG---CGAT Read2: GCGAT AAGCGTCGAT AAG---CGAT G---CGAT Cryp6c indels AAGCGTCGAT AAGCGAT GCGAT AAGCGTCGAT AAG---CGAT G---CGAT Q score recalibra6on Single sample genotyping SNP filtering Bases mismatching reference in red

30 Local realignment iden6fies the most parsimonious alignment along all of the reads at a problema6c locus 1. Find the best alternate consensus sequence that, together with the reference, best fits the reads in a pile (maximum of 1 indel) Ref: Three adjacent SNPs AAGCGTCG Realigning determines which is beqer AAGCGTCG AAG---CG Read pile consistent with the reference sequence Read pile consistent with a 3bp inser6on 2. The score for an alternate consensus is the total sum of the quality scores of mismatching bases 3. If the score of the best alternate consensus is sufficiently beqer than the original alignments (using a LOD score), then we accept the proposed realignment of the reads

AZer Local realignment enabled us to find ~90% of short

31 Before Local realignment uncovers the hidden indel in these reads and eliminates all the poten6al FP SNPs AZer Local realignment enabled us to find ~90% of short indels with ~70% specificity in a blind simula6on assessment

Modeling the error process An accurate error model is essen6al for reliable downstream analyses such as SNP calling Pr{ observing base b true genotype is G } What is the probability that b (e.g., A) is actually some other base (e.

32 Modeling the error process An accurate error model is essen6al for reliable downstream analyses such as SNP calling Pr{ observing base b true genotype is G } What is the probability that b (e.g., A) is actually some other base (e.g., either, C, G, or T)? This prob. is encoded by the phred scaled quality score The quality scores reported by the Solexa, SOLiD, and 454 base callers are inaccurate To correct them, we examine the aligned reads and use the reference mismatch rate at non dbsnp sites to recalibrate the reported quality scores We can also account for covariates of base errors, such as local sequence context and machine cycle, to iden6fy subsets of higher quality bases Ini6al alignment MSA realignment Q score recalibra6on Single sample genotyping SNP filtering

33 Recalibra6on make quality scores more accurate 1000 genomes 454 lane Empirical Q score Q40 Q30 Q20 Q10! Ini6al!!!!!!!!!!!!!!!!!!! Recalibrated!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! BeQer fit Q0 More informa6ve Q0 Q10 Q20 Q30 Q40 Reported Q score Q0 Q10 Q20 Q30 Q40 Reported Q score

34 Recalibra6on removes some error covariates 1000 genomes 454 lane Ini6al Recalibrated +10 Empirical! Reported Quality!10! Difference between Reported and empirical Q score Covariates corrected AA AG CA CG GA GG TA TG Dinucleotide Dinucleo6de context AA AG CA CG GA GG TA TG Dinucleo6de context

35 Recalibra6on iden6fies high quality bases and improves SNP calls 1KG 454 lane IniFal RecalibaFon No. bases in lanes 80M 80M Lane wide reported Q Lane wide empirical Q RMSE between Q reported and empirical 17,554 9,635 % of true Q25 bases 89% 95% % of true Q30 bases 0% 53% Iden6fies >50% bases as true Q30 Results in ~10% more SNP calls at same quality compared to unrecalibrated data

using pileup of bases and associated quality scores at given locus L(G D) computed for all 10 genotypes ( ) ( ) Confidence in call

36 Bayesian SNP Caller for Pilot 2 Bayesian model Likelihood for the genotype Prior for the genotype L(G D) = P(G) P(D G) Likelihood of the data given the genotype Ini6al alignment Prior genotype probabili6es enforce variant expecta6on rates Likelihood of data computed using pileup of bases and associated quality scores at given locus L(G D) computed for all 10 genotypes ( ) ( ) Confidence in call given by lod = log10 L G best D L G ref D T=5 is common 5.0 MSA realignment Q score recalibra6on Single sample genotyping SNP filtering

Filtering poor SNP calls in pilot 2 We use a baqery of expecta6on tests to separate likely FP SNPs from our SNP calls This is possible because erroneous SNP calls ozen result from recurring systema6c

37 Filtering poor SNP calls in pilot 2 We use a baqery of expecta6on tests to separate likely FP SNPs from our SNP calls This is possible because erroneous SNP calls ozen result from recurring systema6c errors We flag a SNP as a likely FP if it exhibits unusual behavior according to: In excessive depth of coverage Occurs preferen6ally on a single strand Has a skewed allelic imbalance In a region of poor read mapping Occurs in very close proximity to other SNPs Ini6al alignment MSA realignment Q score recalibra6on Single sample genotyping SNP filtering

38 Evalua6ng SNP call quality Did I get the right number of calls? The number of SNP calls should be close to the average human heterozygosity of 1 variant per 1000 bases Only detects gross under/over calling Concordance with hapmap chip results? OZen we have genotype chip data that indicates the hom ref, het, hom var status at millions of sites Good SNP calls should be >99.5% consistent these chip results, and >99% of the variable sites should be found The chip sites are in the beqer parts of the genome, and so are not representa6ve of the difficul6es at novel sites What frac6on of my calls are already known? Reasonable transi6on to transversion ra6o (Ti/Tv)? dbsnp catalogs most common varia6on, so most of the true variants found will be in dbsnp For single sample calls, ~90 of variants should be in dbsnp Need to adjust expecta6on when considering calls across samples Transi6ons are twice as frequent as transversions (see Ebersberger, 2002) Validated human SNP data suggests that the Ti/Tv should be ~2.1 genome wide and ~2.8 in exons FP SNPs should has Ti/Tv around 0.5 Ti/Tv is a good metric for assessing SNP call quality A C G T transi6ons transversions

39 A quality score aware Bayesian SNP caller produces accurate SNP calls Chromosome 1, NA12878 calls from Solexa only We find 99.3% of the variable chip sites and call het / hom genotypes with 99.9% accuracy The overall Ti/Tv is ~2.1, very close to expecta6on SNPs 271K Genotype chip concordance All calls dbsnp % 88% Ti/Tv % sensi6vity / 99.9% specificity Novel calls 30K calls Ti/Tv = / 884 variants per base, a bit higher than 1 / 1000 expecta6on The majority of our SNPs are at known sites, consistent with expecta6ons The Ti/Tv suggests a ~30% FP rate in this group. Calls from recalibrated, indel realigned Solexa NA12878 with LOD > 5

40 Consistency among SOLiD, 454, and SOLEXA reads enables an even more accurate set of calls Chromosome 1, NA12878 calls requiring calls in solexa and 454/SOLiD All calls We lose some sensi6vity to find sites at hapmap SNPs 235K Genotype chip concordance dbsnp % 92% Ti/Tv % sensi6vity / 99.9% specificity 1 / 1052 variants, now very close to 1/1000 expecta6on Our dbsnp rate increased by 4% Novel calls 16K calls Ti/Tv = 2.13 The novel calls are now as good as the SNPs at known sites Calls from recalibrated, indel realigned NA12878 with LOD > 5

41 Using these concordant calls allows us to iden6fy de novo muta6ons Algorithm for iden6fying puta6ve de novo muta6ons De novo muta6on calls from chr1 of NA12878 Dad Confident homozygous reference site Mom Confident homozygous reference site Broad Sanger Puta6ve de novo 156 Daughter Novel SNP consistent in all three techs This set includes 4 true de novo muta6ons! Calls from recalibrated, indel realigned NA12878, NA12891, NA12892 ValidaPon data courtesy of MaR Hurles and Philip Awadalla

42 Mom Dad No evidence in parents 454 Child SLX Consistent in all three technologies SOLid Validated as a true de novo muta6on

We apply a generaliza6on of the single sample caller to pilot 1 4x reads on average Individual 1 Single sample calls

to combine our poorly determined single sample calls (its 4x azer all) to make high quality popula6on calls We have

Michigan (Abecasis) to make project wide Pilot 1 calls Other approaches use LD to separate machine errors (which are

43 We apply a generaliza6on of the single sample caller to pilot 1 4x reads on average Individual 1 Single sample calls Allele frequency Individual 2 Expecta6on maximiza6on SNPs Individual N Genotype frequencies This approach allows us to combine our poorly determined single sample calls (its 4x azer all) to make high quality popula6on calls We have been working with the Sanger (Durbin) and U. Michigan (Abecasis) to make project wide Pilot 1 calls Other approaches use LD to separate machine errors (which are inconsistent with LD) from true variants (which are) Very powerful but introduces an LD bias into the call set The best combined approach is s6ll an open ques6on Work of Jared Maguire and Mark Daly

Available in preliminary form from 1000 genomes Pilot 1 ~ 17M SNPs discovered in three

7B genotyped sites and ~3M SNPs per person in three trios to very high accuracy Pilot 3

for all pilots 1, 2 and 3 by several centers and groups around the world All three

44 Available in preliminary form from 1000 genomes Pilot 1 ~ 17M SNPs discovered in three popula6on with limited genotype certainty Pilot 2 ~2.7B genotyped sites and ~3M SNPs per person in three trios to very high accuracy Pilot 3 ~13K SNPs in 1000 genomes with MAF >1% to high accuracy Preliminary calls have been made for all pilots 1, 2 and 3 by several centers and groups around the world All three pilots are proceeding to valida6on in the next month Final, high quality calls by November Publica6on and public release in December

45 Help develop and apply methods in NGS to medical gene6cs projects The Genome Sequencing and Analysis group in Medical and Popula6on Gene6cs at the Broad Ins6tute is hiring Computa6onal Biologist Ph.D. level research scien6st focused on algorithmic R&D Bioinforma6c Analyst B.A./M.A. level analyst focused on algorithmic R&D Senior SoZware Engineer B.A./M.A./Ph.D in CS with 5+ years of experience to lead MPG sozware development projects SoZware Engineer B.A. in CS to develop sozware throughout MPG Talk to me for more informa6on or

MPG NGS workshop I: SNP calling

MPG NGS workshop I: SNP calling Mark DePristo Manager, Medical and Popula