Genome STRiP ASHG Workshop demo materials. Bob Handsaker October 19, 2014

Size: px
Start display at page:

Download "Genome STRiP ASHG Workshop demo materials. Bob Handsaker October 19, 2014"

Transcription

1 Genome STRiP ASHG Workshop demo materials Bob Handsaker October 19, 2014

2 Running Genome STRiP directly on AWS Genome STRiP Structure in Populations Popula'on)aware-discovery-andgenotyping-of-structural-varia'onfrom-whole)genome-sequencing- Integra'ng-sequencing-data-features- -- Split&reads& Read&pair& spacing& Depth&of& coverage& with-popula'on-pa<erns-across-genomes-- Allele&& sharing& Popula8on& heterogeneity& Allelic&& subs8tu8on& Shared&& haplotypes& Harvard Medical School

3 Cloud demo: Genome STRiP command line StarCluster Cloud Storage Sequencing data Amazon Web Services Genome STRIP

4 Cloud compubng scenarios Why are people interested in Genome STRiP on the cloud? Increase compute and storage capacity for large- scale processing Large genome studies Economical and with short lead Ame U>lize data sets that are stored in the cloud Public data sets (e.g Genomes) Data sharing with collaborators No need to download bulky data to each site

5 Cookbook recipe: Genotyping in 1000 Genomes Phase 1 Inputs Outputs A site VCF file describing the variants (e.g. large deleaons) to genotype Genotype VCF file Plots for quality control 1000 Genomes Data You choose the BAM file locaaon: StarCluster Cached copy on Amazon S3 storage HTTP from NCBI or EBI Uses the StarCluster sovware from MIT for Amazon EC2 provisioning hzp://star.mit.edu/cluster samples chr Kb LOD: 1.3 CR: 100.0% MAF: 0.22 EL: 9.5Kb 92.8% normalized read depth CN0 CN1 CN2 CN3 CN4 CN5 CN6+ NC

6 Demo

7 Cloud compubng support in Genome STRiP Remote BAM file access Support for mulaple file access protocols in addiaon to local files HTTP / HTTPS FTP Amazon S3 protocol Pre- computed metadata for 1000 Genomes Phase 1 and Phase 3 Eliminates the need to run Genome STRiP preprocessing Avoids the need to download the 1000 Genomes BAM files Metadata is relaavely compact: 5Gb (Phase1) and 13Gb (Phase 3) Vp://Vp.broadinsAtute.org/pub/svtoolkit/public_metadata/ Cookbook recipes for common scenarios Genotyping variants in 1000 Genomes samples

8 Genome STRiP cookbook

9 Sample genotyping output Standard VCF file with sample genotypes ##fileformat=vcfv4.1 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096 HG DEL_2_99615 A <DEL>.. END= GT:FT:GQ 0/0:PASS:71 0/1:PASS:14 Genotyping plot for visual verifica>on Histogram of normalized read depth Colors indicate confident calls (gray samples are below 95% confidence) Small numbers on plot indicate evidence from read pairs or split reads samples MERGED_DEL_2_99615 chr Kb LOD: 1.3 CR: 96.0% MAF: 0.04 EL: 3.9Kb 99.9% CN0 CN1 CN2 CN3 CN4 CN5 CN6+ NC normalized read depth

10 Command summary starcluster start gs- cluster - s 1 starcluster put gs- cluster example.vcf example.vcf starcluster sshmaster gs- cluster./genotype_sites.sh example.vcf run1 starcluster get gs- cluster run1 run1 starcluster terminate gs- cluster Launch Amazon compute cluster Copy input file from local to cloud Log in to remote cluster Run genotyping command script Copy output files from cloud to local Shut down compute cluster

11 For more informabon. Bonus evening session Tonight (Monday) 6:30 8:00 PM Room 24, Upper Level Web site hzp:// Support forum (Genome STRiP topic in GATK forum) hzp://gatkforums.broadins>tute.org/categories/genomestrip AWS Support In Genome STRiP Seva Kashin Poster 603 T (Tuesday a]ernoon) MulG- allelic copy number variagon in humans Early look at upcoming Genome STRiP funcaonality for duplicaaons and mula- allelic CNVs

12

13 Intro Slides for Gabor

14 Genome STRiP Genome STRucture in Popula>ons Integrates mulaple features of sequence data with populaaon- based pazerns across many individuals Handsaker, R.E., Korn, J.M., Nemesh, J. & McCarroll, S.A. Discovery and genotyping of genome structural polymorphism by sequencing on a populaaon scale. Nat Genet 43, (2011)

15 Genome STRiP Structural variabon analysis from sequence data Integra>ve Combines mulaple feature of the sequence data (read pairs, read depth, split reads) IntegraAve approaches have consistently shown higher accuracy Popula>on- aware Increases power and accuracy ParAcularly important for low- coverage genomes Modular architecture Discovery of new variants Genotyping of newly discovered variants and/or known variants Includes tools for QC / analysis Ini>al prototype developed for analyses in 1000 Genomes Project Low false discovery rate and high sensiavity

16 Demo Slides