WGS for the 100,000 Genomes

Size: px
Start display at page:

Download "WGS for the 100,000 Genomes"

Transcription

1 WGS for the 100,000 Genomes Mark T. Ross Population and Medical Genomics Group Applying Genomics to Cancer, 21 Sept Illumina, Inc. All rights reserved. Illumina, 24sure, BaseSpace, BeadArray, BlueFish, BlueFuse, BlueGnome, cbot, CSPro, CytoChip, DesignStudio, Epicentre, GAIIx, Genetic Energy, Genome Analyzer, GenomeStudio, GoldenGate, HiScan, HiSeq, HiSeq X, Infinium, iscan, iselect, ForenSeq, MiSeq, MiSeqDx, MiSeqFGx, NeoPrep, Nextera, NextBio, NextSeq, Powered by Illumina, SeqMonitor, SureMDA, TruGenome, TruSeq, TruSight, Understand Your Genome, UYG, VeraCode, verifi, VeriSeq, the pumpkin orange color, and the streaming bases design are trademarks of Illumina, Inc. and/or its affiliate(s) in the U.S. and/or other countries. All other names, logos, and other trademarks are the property of their respective owners.

2 Genomics England Partnership 100,000 genomes Cancer and rare genetic disease ISO-accredited workflow (2016) Data delivered electronically, stored securely and analysed in England Data Centre Combine with extracted clinical information for analysis, interpretation, and aggregation 2

3 A Clinical Ecosystem for Precision Medicine Researcher Knowledge base Clinician Patient information Treatment choice Event Birth Biopsy Surgery Relapse Chemo Death Metastasis Relapse Timeline Sequencing 3

4 Fast WGS for genetic diagnosis of intensive care patient Undiagnosed condition: Male child presents at 5 months with developmental regression, hypotonia, and seizures One WGS test: DNA sample to answer in 4 days Filter annotations rapidly using generic queries: Number of transcripts with functional variants: 13,367 With 2 variants ** and <5% allele frequency: 1,458 And predicted to be functional: 287 And in a gene linked to disease: 35 And predicted to be deleterious: 21 Evolutionarily conserved: 6 Apply control genomes filter: 1 Confirm Menkes diagnosis A novel hemizygous variant in ATP7A, a gene with mutations known to disrupt copper metabolism ** shorthand for homozygous., compound heterozygous, hemizygous positions 4 Kingsmore et al. (Children s Mercy Hospital)

5 WGS of brain tumour defines new treatment options 44-year old with glioblastoma recurrence Not responding to treatments Sequence genomes from three biopsies and normal genome DNA rearrangements amplify growth regulator genes New treatment indicated (10 days from start) PDGFRA gene amplified normal Amplify PDGFRA C-kit c-kit Imatinib Sunitinib Pazopanib PI3K AKT Cancer Growth Cancer + + mir-26a-2 PTEN 5 Swanton et al. (CRUK London Research Institute)

6 Genome-wide mutation signatures in cancer Trinucleotide contexts around somatic mutations reveal signatures of exposure Mutation context spectra reveal catastrophic events - kataegis Sequence context Context and density Smoking UV light Smoking: Most are NCN>NAN UV: Most are TCN>TTN Alexandrov et al Sanger C>T, 14Mb on chr 6 of a BRCA1 mutated tumour 6 Alexandrov et al Nik-Zainal et al. 2012

7 Evolving the technology for medicine Sample Sequence Analyse Annotate Interpret Answer Why genomes? Working with clinical samples Maximise throughput, coverage and accuracy Shrink the data footprint Annotation, interpretation, reporting Scale up: a fast, convenient, high throughput ISO workflow Aggregate the information for maximum value 7

8 Why focus on whole genomes? The complete, accurate genetic make-up of an individual Simple library prep: reproducible, automated, low cost, low hands-on, fast (4 hrs) Minimum bias: PCR-free, recovers all genome (reference and non-reference) Low input DNA: maximises access to clinical samples Maximum coverage: (depth dependent) 30x genome (110 Gb): >96% of hg19 genome ~98% of e! exons All variant types: all variant types can be detected; both novel and known Large SVs Balanced translocation Distant consanguinity Uniparental disomy Coding variants Non-coding variants Panel Knowns Knowns No No Yes Knowns Exome Partial No Partial Partial Yes No Genome Yes Yes Yes Yes Yes Yes Questions/trade-offs: Cost, data volume/management, study size (n) Sensitivity, diagnostic yield, long term value 8

9 Normalised Genome Coverage Improving coverage and reducing bias ARX exon 2, 77% GC / HiSeq X Nano v1 Nano v2 1 ug PCR v2 clusters 1ug PCRv2 1ug 1 ug PCR PCR 500ng ng PCR PCR-free Free (gel) 100ng PCR-free PCR (beads) free Sample 1 100ng PCR-free PCR (beads) free Sample 2 HiSeqX PCR-free PCRFree on X (beads) (V2) PCR-free 9

10 Improving technology Patterned flowcells for higher density (2x) Faster imaging for greater speed (6x) Improved SBS chemistry for accuracy & readlength ( % cycle efficiency) Increasing data rate & reducing cost 2 um features 0.4 um features (images to same scale) 600 Data Rate (Gb/day) 10 WGS Price ($k/30x)

11 Assessing accuracy Variant calling accuracy based on Pt truth sets enable objective algorithm performance assessment & improvement Pipeline SNV (0 bp) Indel (1-50 bp) Time (hr)** Sn (%) Sp (%) Sn (%) Sp (%) HAS (Isaac) (H2 15) GATK Sensitive to depth and quality of alignment Very low false positive SNP or indel call rate **Using 40 CPU, Intel 2.80 GHz, 132 GB RAM** 11

12 Shrinking the data footprint Cloud 120G 80G 30G 2G 2G BAM RRQS 1 BAM CRAM gvcf 2 Annotated gvcf Plug And Play Temporary Or R&D Archive & Look-up Archive & Aggregate ipad 1 Reduced Resolution Quality Scores 2 Genome VCF encodes all positions and low quality calls 12

13 Adding comprehensive annotation How do we create medically useful genomes? Capture publicly available/licensed information (Ve!P-based) Genes: Nomenclature, coordinates, transcripts (ensembl, NCBI) Functional effects: VEP consequence, regulatory regions (ensembl, UCSC, Encode) Population information: 1000-genomes, EVS Disease association: Clinvar, COSMIC, PharmGKB (+ manual review) Accelerate and improve efficiency of annotation process Streamline reporting algorithms and data storage structures Speed: From 11 hr 48 min to 3 min per genome Footprint: From 2+8 Gb Cache+RAM to Gb per genome 13

14 Evolving the technology for cancer medicine Sample Sequence Analyse Annotate Interpret Answer Tumour and normal genomes Extra depth of the tumour genome sequence Sample purity, quality and quantity Variability in fixation and extraction processes Evolution of disease and heterogeneity Somatic variant calling, annotation, interpretation, reporting 14

15 Abundance Somatic mutation frequency (%) Disease evolution and heterogeneity R CLL WGS timeseries Heterogeneity, treatment response Remission sample has disease NORMAL CLASS NORMAL CLASS CLASS 3 CLASS 2 2 CLASS 3 CLASS 2 50 CLASS 1 1 CLASS 1 0 a b c d e Time points 15 Schuh et al., Oxford 0 c

16 Secondary Analysis Workflows Germline variants Somatic variants Sequencing data Isaac aligner BAM file Normal BAM + Tumour BAM Firebrand Isaac Diploid Variant Calling (Starling) Canvas Manta germline Strelka Seneca Manta somatic Metrics VCF germline SNVs and indels VCF germline CNVs VCF germline SVs VCF somatic indels VCF somatic SNVs VCF somatic CNVs VCF somatic SVs 16

17 Excerpts from colorectal cancer Encore report

18 Workflow: DNA to annotated Genome Automation + 96 well format Sample Quant Quality Control, FFPE check Library Prep Library Quality Control qpcr X10 Sequence Run + QC Analysis Quality Control, identity check, contamination screen Network Delivery Sample Accession Genotype pre Library Amp (if needed)/ Genotype post Flowcell prep Genome build, tumour / normal subtractiongvcf annotation HiSeq Analysis Software Pre-PCR lab LIMs Project configured Track lab and analysis processes Project management, Pipeline automation 18

19 Genomes (n) Status of initial pipeline Rare genetic disease samples (18 th Sept 2015) Genomes n=7, DNA in QC Fail QC Sequencing Analysis Delivered Fail/on hold 4 samples failed on DNA quantity 1 sample contaminated 30 samples on-hold awaiting resolution of inconsistent gender-manifest information 19

20 Future Prospects Evaluate utility of genome sequencing in healthcare (Genomics England) A powerful lens for the patient and personalised medicine Early applications in rare genetic disease and cancer (40% of population) Population-level screening for systematic studies Aggregate G, and G+P data to add value (standard, secure, accessible) Integrate with targeted testing, e.g. pre-natal, MRD and other tests Researcher Knowledge Clinician Information Treatment choice Patient 20

21 Acknowledgements David Bentley Sean Humphray Elliott Margulies Mike Eberle Ryan Taft Lisa Murray Klaus Maisinger Come Raczy Semyon Kruglyak Stewart MacArthur Philip Tedder John Peden Roman Petrovski Kevin Hall Keira Cheetham Jennifer Becq Miao He Russell Grocock Peter Saffrey Illumina Josh Bernd Richard Shaw Chris Saunders Shankar Ajay Pedro Cruz Jason Betley Jacqueline Weir Zoya Kingsbury Core Sequencing Group David Quackenbush Eddy Kim Van Lee-Pham Anna Powell Francisco Garcia Kirby Bloom Tina Hambuch Erica Ramos & many others Collaborators The team at Genomics England Stephen Kingsmore and the Children s Mercy Hospital, Kansas City team Charles Swanton and team, CRUK London Research Institute Anna Schuh, Jenny Taylor and the WGS500 consortia, Oxford Lisa Russell, Christine Harrison and team, LRCG Newcastle Mike Stratton and the Cancer Genome Project team, Wellcome Trust Sanger Institute, Hinxton Gil McVean and Zamin Iqbal, WTCHG, Oxford Jan Veldink and team, UMC Utrecht Willem Ouwehand, Lucy Raymond and the UCAM Project team, Haematology, Addenbrookes, Cambridge Andrew Beggs and team, Cancer Sciences, Birmingham 21