Next Genera*on Sequencing II: Personal Genomics. Jim Noonan Department of Gene*cs

Size: px
Start display at page:

Download "Next Genera*on Sequencing II: Personal Genomics. Jim Noonan Department of Gene*cs"

Transcription

1 Next Genera*on Sequencing II: Personal Genomics Jim Noonan Department of Gene*cs

2 Personal genome sequencing Iden*fying the gene*c basis of phenotypic diversity among humans Gene*c risk factors for disease All common diseases have a gene*c component Common variants explain a small frac*on of inherited risk for common disease Disease risk in an individual likely to be due to rare or novel muta*ons of large effect Iden*fying these requires sequencing thousands of individuals Whole- genome sequencing and variant discovery as a diagnos*c/ prognos*c tool Personal genotyping: SNPs and copy number variants Assess individual disease risk based on genotype

3 The spectrum of clinically relevant muta*on

4 Outline Variant detec*on in personal genomes: de novo assembly vs. *ling Strategies for genera*ng a reference genome Challenges to de novo assembly using short reads Tiling short reads onto reference for variant detec*on Personal genome sequencing: proof of concept SNP detec*on in individual genomes Copy number varia*on Genome reduc*on strategies Exome sequencing Iden*fying disease muta*ons by personal genomics

5 Assembling individual genomes >>10 9 sequencing reads 36 bp- 10kb 3 Gb

6 Assembling genomes: strategy Generate reads Terminology and concepts Find overlapping reads genomic clone: A vector containing an insert of genomic DNA Assemble reads into con*gs Join con*gs into scaffolds using mate pairs con*g BAC: kb Fosmid: 40 kb Plasmid: 3-5 kb mate pair: reads from two ends of a clone (plasmid, BAC or fosmid) containing an insert physically mapped to the genome; used to order and orient con*gs and scaffolds coverage: average number of reads covering a par*cular posi*on in the assembly Join scaffolds into finished sequence mate pair scaffold N50: the maximum length L such that 50% of all bases lie in con*gs at least L bases long. AGTTGTATTATTAGAAACTGAGGGCTAAAAACTGTGCACATACACAGACACACATATTATTTTAATATAGATTTTCAATAATTGGTCTAGGATAAGGATAATATACAG confounding factors: repeats and polymorphism

7 Genera*ng the human reference genome minimum *ling path plasmids: 3-5 kb inserts

8 Whole genome shotgun sequencing Shear genome into 3-5kb fragments & clone into plasmids from end sequencing of BACs or fosmids This does not work well with short reads

9 Alignment strategies for personal genomics De novo assembly is difficult with current next- genera*on sequencing 454 provides longer reads but lower coverage Illumina, SOLiD, Helicos, etc. provide very high coverage but very short reads Current strategies rely on *ling short reads to reference human genome Mismatch detec*on: SNPs are mismatches rela*ve to reference High coverage (>30x) Filtering for low quality bases Copy number variant detec*on?

10 Mapping short reads to a reference genome Eland quality aware aligner for Illumina data alignment policies: allows up to 2 mismatches/alignment non- unique alignments are discarded Maq and BWA quality aware - take seq quality into account allow non- unique alignments Index methods reference genome is loaded into ac*ve memory as k- mers very fast alignments SOAP Bow*e/Tophat/Cufflinks SNP detec*on, paired- end mapping, RNA- seq, ChIP- seq, etc.

11 Several recently sequenced personal genomes

12 Abundant single nucleo*de polymorphism in personal genomes Muta*on burden: 26,140 coding SNPs 5,361 non- conserva*ve AA changes 153 premature stop codons >30x coverage is necessary for SNP detec*on Bentley et al. Nature 456:53 (2008)

13 Personal genomics and human gene*c diversity Mix of 454 and Illumina KB1: 10.2x seq on 454; N50 scaffold length 156 kb Schuster et al. Nature 463:943 (2010)

14 Personal genomics and human gene*c diversity

15 Inser*ons and dele*ons in personal genomes African male (Illumina) Watson: large copy number variants involving 26kb 1.5 Mb East Asian male (Illumina) Wang et al. Nature 456:60 (2008)

16 Abundant copy number varia*on in personal genomes (segmental duplica*ons) Alkan et al. Nat Genet 41:1061 (2009)

17 Copy number varia*on in personal genomes (segmental duplica*ons)

18 Making sense of genome- wide varia*on data Millions of polymorphisms in every genome Most are not clinically relevant Thousands of coding changes What about SNPs outside of genes - regulatory muta*ons? Personal genomics is s*ll not cost effec*ve at large scale Muta*ons underlying most common diseases are thought to be rare or novel We will need to screen thousands of individuals to detect clinically relevant muta*ons Focus on screening func*onal sequences where muta*ons are interpretable (exons)

19 Genome reduc*on by array sequence capture NimbleGen whole exome arrays: 2.1 M features; > 60 bp probes Target a defined set of func*onal sequences (exons, promoters, enhancers, etc.) Sequence a large number of individuals at rela*vely low cost

20 Targeted resequencing of 12 exomes Detec*ng muta*ons causing Freeman- Sheldon syndrome Known muta*ons in MYH3 Ng et al. Nature 461:272 (2009)

21 Exome sequencing iden*fies muta*ons causing a Mendelian disorder Miller syndrome extremely rare likely autosomal recessive inheritance Sequenced exomes of 2 affected sibs plus 2 unrelated affected individuals Ng et al. Nat Genet 42:30 (2010)

22 Exome sequencing iden*fies muta*ons causing a Mendelian disorder

23 Iden*fica*on of a causa*ve muta*on in a pa*ent with Charcot- Marie- Tooth neuropathy by whole- genome sequencing Jim Lupski, Baylor College of Medicine 29.9x using 50 bp SOLiD reads Lupski et al. N Engl J Med (2010)362:1181

24 Iden*fica*on of a causa*ve muta*on in a pa*ent with Charcot- Marie- Tooth neuropathy by whole- genome sequencing

25 Evolu*onary conserva*on can be used to priori*ze variants derived allele frequency increasing conserva*on Iden*fy regulatory SNPs Cooper et al. Nat Methods. 7:251 (2010)

26 Analysis of gene*c inheritance by whole- genome sequencing Family quartet: two parents, two children (both with Miller syndrome) Iden*fy transmioed alleles, recombina*on events and de novo muta*ons Transmission informa*on reduces error rates Roach et al. Science /science (2010)

27 De novo assemblies from short reads SOAPdenovo short read assembler Use paired- end reads of large fragments to assemble con*gs and close gaps Li et al. Genome Res. 20:265 (2010) Will be rendered obsolete by inexpensive massively parallel long- read technologies

28 Personal genomics is a reality Conclusions Genome sequencing will be a rou*ne diagnos*c tool $5,000 to sequence single genome; current cost for clinical resequencing of single genes Your genome will be sequenced Challenge is in interpreta*on Millions of SNPs in a single genome: clinical relevance Noncoding variants: regulatory muta*ons, structural variants Millions of genotypes correlated with physiological data for thousands of diseases Massive data analysis effort required Cri*cal ques*ons Who will most benefit? How do we protect the data? How do we interpret variants of unknown/uncertain/marginal clinical significance? How do we communicate this to the public?