NUCLEOTIDE RESOLUTION STRUCTURAL VARIATION DETECTION USING NEXT- GENERATION WHOLE GENOME RESEQUENCING Ken Chen, Ph.D. kchen@genome.wustl.edu The Genome Center, Washington University in St. Louis
The path to genomic medicine Human genome sequencing finished in 2003 Genomic medicine: healthcare tailored to the individual based on genomic information The central mission for NHGRI and the field of genomics is to establish the path to the realization of genomic medicine. -Eric Green, Personal Genomes, 2010 Question: How do we establish the path? Dr. Ellis and Rhonda Levan Rhonda Levan, breast cancer clinical trial participant, has been given a new lease on life, thanks to Matthew Ellis, MB, BChir, PhD, a Washington University medical oncologist at the Siteman Cancer Center
Medical Genomes @ WUGC Identification of genomic variants in individual genomes Single Nucleotide Variants (SNVs) Structural Variants (SVs) Genotypical characterization of the variants Frequency in population Heterogeneity, origin, and progression Phenotypical characterization of the variants Functional annotation Integration with functional/clinical data (expression, interaction) Network biology Model organisms
The genomic variants Our 3Gb genomes are ~99% identical, however each individual genome differs from the reference: Single nucleotide variants (SNVs), ~3-4 M Structural variants (SVs), ~300-500 K genomic alterations that involve segments of DNA that are longer than 1 bp Novel sequences, ~5 Mb [Li et al., Nature Biotech, 2010]
Classes of Structural Variants (SVs) Classes of SVs reference new allele Change in # copies of refseq Sequence spacing orientation Deletion Yes Yes Yes Novel sequence Tandem duplication Yes Yes Sometimes VNTR Yes Yes No Dispersed duplication Yes Yes Yes Novel Insertion No Yes Yes SINE/LINE Insertion Yes Yes Yes Inversion No Yes Yes Translocation Sometimes Yes Yes Courtesy: Matt Hurles, 1000 Genomes
SVs in cancer Nature Genetics 36, 331-334 (2004) Total cases: 59,570, Nov. 22, 2010, Mitelman Database of Chromosome Aberrations
Identification of variants: the resequencing approach Computer Reference DNA Samples Sequencer Reads SNVs SVs
SV detection paired end read mapping Var: Ref: SV d 200-500 bp DNA fragments d 3 types of evidence Normal read-pair Discordant read-pair Split-Read Microhomology Approaches Read Depth Read Pair Split Reads Classes DEL, DUP All except large novel insertion DEL, Small INS,DI, Inversion Size Range > kb >50 bp >1 bp, < 1 Mb Resolution kb 50 bp 1 bp Tools CNV-HMM, CMDS BreakDancer, GASV Pindel
SV detection paired end read mapping Var: Ref: SV d 200-500 bp DNA fragments d 3 types of evidence Normal read-pair Discordant read-pair Split-Read Microhomology Detection power Read Depth Read Pair Split Reads Targeted Assembly Homology SV size Maybe Maybe Insert size Read length Maybe Maybe Physical Coverage Sequence Coverage Maybe Maybe
BreakDancer: detect SVs from discordant read pairs Var Reference Type Deletion Insertion Inversion a b Intra-chromosomal translocation Inter-chromosomal translocation SV k i P(n i k i ) n i ~ Poisson(λ i ) λ i = (a + b)n G Density c del Jointly analyze multiple libraries The SV score summarizes: 1. Number of supporting reads 2. Size of the anchor region a, b 3. Physical coverage of each library Insert size 4. Insert size distribution 2 χ 2m m = 2 log e (P j ) j =1 Q = log 10 (P) Chen et al., Nature Methods, 2009
CNVs detected by BreakDancer
TIGRA_SV: assemble SVs to nucleotide resolution Var: Ref: SV BreakDancer TIGRA_SV TIGRA_SV Integration AGCTGT---CA! AGCTGTTGTCA! Chen et al., in revision 1000 Genomes Consortium, Nature, 2010 Mills et al, Nature, 2011
Soft-clipping at SV breakpoints
CREST: SV identification from soft-clipped reads Wang et al., submitted
Washington University genome center genomics landmarks PolyScan 2007 cnvhmm SomaticSniper BreakDancer Pairoscope Varscan CMDS TIGRA_SV Pindel CREST 2008 2009 2010 2011 http://genome.wustl.edu/software/ Nature 455, TSP Lung Nature 455, TCGA Brain Nature 456, AML1 first cancer genome NEJM, AML recurrent mutation IDH1 Nature 464, Breast cancer metastasis xenograft Nature 467, 1000 Genomes Pilot NEJM, AML DNMT3A connect genomics and epigenomics Nature, 1000 Genomes Pilot SV JAMA, Clinical diagnosis of atypical APL fusion
The genomic tsunami Dec. 2010, WashU Genome Center Sequenced: Total Number of Cases: 765 Total Number of Cases Completed: 408 Total Number of Bases produced: ~100 Tb More in 2011!
Diagnose a cryptic fusion using whole genome sequencing Case history 39 y.o. female Presented with pancytopenia and DIC Histology: promyelocytic morphology
Complex cytogenetics, inconsistent with APL Cytogenetics: 46, XX, del(9)(q12q32),del12(q12q21)[6] Complex (poor risk), with no t(15;17) Interphase FISH: most consistent with an RARA-PML fusion, not the pathogenic PML-RARA fusion
Diagnostic conundrum Questions: Does this patient have APL? Leukemia with promyelocytic features FISH: No PML-RARa Cytogenetics: Complex (poor risk), with no t(15;17) 46, XX, del(9)(q12q32),del12(q12q21)[6] Options: APL: All-transretinoic acid (ATRA) low-cost, non-toxic, and good outcome Cytogenetically complex AML: an allo-transplant expensive, with a risk of lethal GvHD
Whole genome sequencing Tumor DNA: 187.1 Gigabases (~43.7X) 99.74% heterozygous SNPs 99.53% homozygous SNPs Skin (normal) DNA: 200.1 Gigabases (~46.8X) 99.76% heterozygous SNPs 99.64% homozygous SNPs Single HiSeq run (two flow cells), $10,000 (sequencing+analysis+validation) completed in <6 weeks from sample receipt
Sequence based copy number analysis
The Identification of a cryptic insertional translocation BreakDancer/TIGRA_SV chr15
Gene fusions produced by the insertional translocation Truncated protein Out of frame Expressed, Pathogenic
Standard FISH failed in diagnosis
Conclusions The patient s unique oncogenotype was determined within the required clinical time frame Correct clinical decision was made: ATRA treatment was indicated, the patient is in remission and doing very well A new set of fosmid-based FISH probes (each 30-40 Kb in size) was made based on this novel discovery Two additional cases of cryptic insertional fusions have been identified so far Time to begin applying whole genome sequencing as a diagnostic approach for potentially understanding atypical cases of diseases
Looking forward Towards clinical sequencing - Higher standard and better algorithms Cancer genomics - Recurrent mutations - Tumor Heterogeneity and progression - Tumor genome architecture Novel algorithms - Complex structural rearrangements - New technology Integrative analysis - RNA profiling and transcriptome assembly - Epigenomic profiling - Networks
Acknowledgements Richard Wilson Elaine Mardis Timothy Ley George Weinstock Collaborators at WashU Medical Genomics The Genome Center Heng Li Matt Hurles Evan Eichler Charles Lee 1000 Genomes Structural Variation Group
Thank you!