Understanding Human Variation

Size: px
Start display at page:

Download "Understanding Human Variation"

Transcription

1 Understanding Human Variation Fiona Cunningham European Bioinformatics Institute November 2012

2 Talk outline Gene-c varia-on Different types Origins Why are all those variants important? Importance and prac-cal applica-ons How is varia-on data discovered? Inves-ga-ng gene-c varia-on and progress over -me Ensembl and modern Bioinforma-cs Building infrastructure for research Interpre-ng variants

3 The Reference Human Genome Published 2001 Finished in 2004 Still incomplete

4 Every individual has a unique genome 4/75

5 5/75 ACCCAATAGCAGAACAGCTACTGGAACTAAAATCCTCTGATTTCAAATAACAGCCCCGCCCACTACCACTAAGTGAAGTCATCCACAACCAC ACACCGACCACTCTAAGCTTTTGTAAGATCGGCTCGCTTTGGGGAACAGGTCTTGAGAGAACATCCCTTTTAAGGTCAGAACAAAGGTATTT CATAGGTCCCAGGTCGTGTCCCGAGGGCGCCCACCCAAACATGAGCTGGAGCAAAAAGAAAGGGATGGGGGACTTGGAGTAGGCATAGGGGC A GGCCCCTCCAAGCAGGGTGGCCTGGGACTCTTAAGGGTCAGCGAGAAGAGAACACACACTCCAGCTCCCGCTTTATTCGGTCAGATACTGAC GGTTGGGATGCCTGACAAGGAATTTCCTTTCGCCACACTGAGAAATACCCGCAGCGGCCCACCCAGGCCTGACTTCCGGGTGGTGCGTGTGC C TGCGTGTCGCGTCACGGCGTCACGTGGCCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAACTAGGCGGCAGAGGCGGAGCCGCTGTGGCACTG CTGCGCCTCTGCTGCGCCTCGGGTGTCTTTTGCGGCGGTGGGTCGCCGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTT TGTCAGCTTACTCCGGCCAAAAAAGAACTGCACCTCTGGAGCGGGTTAGTGGTGGTGGTAGTGGGTTGGGACGAGCGCGTCTTCCGCAGTCC CAGTCCAGCGTGGCGGGGGAGCGCCTCACGCCCCGGGTCGCTGCCGCGGCTTCTTGCCCTTTTGTCTCTGCCAACCCCCACCCATGCCTGAG AGAAAGGTCCTTGCCCGAAGGCAGATTTTCGCCAAGCAAATTCGAGCCCCGCCCCTTCCCTGGGTCTCCATTTCCCGCCTCCGGCCCGGCCT T TTGGGCTCCGCCTTCAGCTCAAGACTTAACTTCCCTCCCAGCTGTCCCAGATGACGCCATCTGAAATTTCTTGGAAACACGATCACTTTAAC GGAATATTGCTGTTTTGGGGAAGTGTTTTACAGCTGCTGGGCACGCTGTATTTGCCTTACTTAAGCCCCTGGTAATTGCTGTATTCCGAAGA CATGCTGATGGGAATTACCAGGCGGCGTTGGTCTCTAACTGGAGCCCTCTGTCCCCACTAGCCACGCGTCACTGGTTAGCGTGATTGAAACT T AAATCGTATGAAAATCCTCTTCTCTAGTCGCACTAGCCACGTTTCGAGTGCTTAATGTGGCTAGTGGCACCGGTTTGGACAGCACAGCTGTA AAATGTTCCCATCCTCACAGTAAGCTGTTACCGTTCCAGGAGATGGGACTGAATTAGAATTCAAACAAATTTTCCAGCGCTTCTGAGTTTTA CCTCAGTCACATAATAAGGAATGCATCCCTGTGTAAGTGCATTTTGGTCTTCTGTTTTGCAGACTTATTTACCAAGCATTGGAGGAATATCG TAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATTTTTTGAAATTTTTAAGACACGCTGCAACAAAGCAGGTATTGACAAATTTTA TATAACTTTATAAATTACACCGAGAAAGTGTTTTCTAAAAAATGCTTGCTAAAAACCCAGTACGTCACAGTGTTGCTTAGAACCATAAACTG G TTCCTTATGTGTGTATAAATCCAGTTAACAACATAATCATCGTTTGCAGGTTAACCACATGATAAATATAGAACGTCTAGTGGATAAAGAGG AAACTGGCCCCTTGACTAGCAGTAGGAACAATTACTAACAAATCAGAAGCATTAATGTTACTTTATGGCAGAAGTTGTCCAACTTTTTGGTT TCAGTACTCCTTATACTCTTAAAAATGATCTAGGACCCCCGGAGTGCTTTTGTTTATGTAGCTTACCATATTAGAAATTTAAAACTAAGAAT C TTAAGGCTGGGCGTGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGTGGGCGGATCACTTGAGGCCAGAAGTTTGAGACCAG CCTGGCCAACATGGTGAAACCCTATCTCTACTAAAAATACAAAAAATGTGCTGCGTGTGGTGGTGCGTGCCTGTAATCCCAGCTACACGGGA GGTGGAGGCAGGAGAATCGCTTGAACCCTGGAGGCAGAGGTTGCAGTGAGCCAAGATCATGCCACTGCACTCTAGCCTGGGCCACATAGCAT C GACTCTGTCTCAAAACAAACAAACAAACAAAAAACTAAGAATTTAAAGTTAATTTACTTAAAAATAATGAAAGCTAACCCATTGCATATTAT CACAACATTCTTAGGAAAAATAACTTTTTGAAAACAAGTGAGTGGAATAGTTTTTACATTTTTGCAGTTCTCTTTAATGTCTGGCTAAATAG AGATAGCTGGATTCACTTATCTGTGTCTAATCTGTTATTTTGGTAGAAGTATGTGAAAAAAAATTAACCTCACGTTGAAAAAAGGAATATTT TAATAGTTTTCAGTTACTTTTTGGTATTTTTCCTTGTACTTTGCATAGATTTTTCAAAGATCTAATAGATATACCATAGGTCTTTCCCATGT CGCAACATCATGCAGTGATTATTTGGAAGATAGTGGTGTTCTGAATTATACAAAGTTTCCAAATATTGATAAATTGCATTAAACTATTTTAA C AAATCTCATTCATTAATACCACCATGGATGTCAGAAAAGTCTTTTAAGATTGGGTAGAAATGAGCCACTGGAAATTCTAATTTTCATTTGAAT AGTTCACATTTTGTCATTGACAACAAACTGTTTTCCTTGCAGCAACAAGATCACTTCATTGATTTGTGAGAAAATGTCTACCAAATTATTTA T AGTTGAAATAACTTTGTCAGCTGTTCTTTCAAGTAAAAATGACTTTTCATTGAAAAAATTGCTTGTTCAGATCACAGCTCAACATGAGTGCT TTTCTAGGCAGTATTGTACTTCAGTATGCAGAAGTGCTTTATGTATGCTTCCTATTTTGTCAGAGATTATTAAAAGAAGTGCTAAAGCATTG AGCTTCGAAATTAATTTTTACTGCTTCATTAGGACATTCTTACATTAAACTGGCATTATTATTACTATTATTTTTAACAAGGACACTCAGTG GTAAGGAATATAATGGCTACTAGTATTAGTTTGGTGCCACTGCCATAACTCATGCAAATGTGCCAGCAGTTTTACCCAGCATCATCTTTGCA CTGTTGATACAAATGTCAACATCATGAAAAAGGGAAATGATTCCATAGCGTTATTATGAAAGTAGTTTTGAACTGTAATGGTAGAGGATGAA TAGCTCACAATACAAATTTGTCATTTCCCTTTAAGAGAGAATTCCCATTTTATGTGAGAGTCCACATGTTCCTCATACCCATAGTTTGCCAC ATCTTGAGTACTCTTCAGAATTATTTGAATTTTTTGAATTTTATCTGTGGAATGTATTTTTTTTTTTTTCTTTTTTGAGACACAGTCTTGCT

6 Single nucleotide variants A single nucleo-de variant is a change that happens at one posi-on in the DNA sequence. A single nucleo-de polymorphism (SNP): Person 1. TTCCCTA Person 2. TTCCTTA

7 Other short variants T C T G G C T Dinucleotide Substitution Insertion Deletion A G T G G T C A C T T

8 Large scale: >50 base pairs to megabases Structural variants: large deletions duplications insertions translocations Copy number variants (CNVs): sequence repeated n times in an individual deletion duplication insertion translocation

9 Origin of Variants SNP SNP SNP SNP SNP Appearance of new variants by mutation E.g. more copies of CCL3L1 SNP SNP SNP SNP SNP Survival of alleles through early generations against the odds HIV resistance SNP Increase of the allele to a substantial population frequency Fixation of the allele in populations Germline variation: passed to descendants. Somatic Mutation: not passed to descendants.

10 Talk outline Gene-c varia-on Different types Origins Why are all those variants important? Importance and prac-cal applica-ons Where did they all come from? Inves-ga-ng gene-c varia-on and progress over -me Ensembl and modern Bioinforma-cs Building infrastructure for research Interpre-ng variants

11 Disease and differences Varia-on: interes-ng for evolu-on, popula-on migra-on and adapta-on Differences in phenotype: Height, intelligence, body mass Single variant disorders: Sickle cell anaemia, cys-c fibrosis Complex Disease: Bipolar disorder, schizophrenia, Alzheimer s Noravirus protec-on (Homozygous for alt allele rs601338) SV, Copy number varia-on: Gene dosage - too few or too many copies lupus, autoimmune disease: too few copies of FCGR3B HIV infec-on resistance: more copies of CCL3L1 Intellectual disorders

12 Prac-cal applica-ons of varia-on Risk assessment Of radia-on exposure, mutagenic chemicals and cancer- causing toxins Molecular and clinical medicine Diagnosis, detec-on and treatment: e.g. myotonic dystrophy, fragile X syndrome, inherited colon cancer, Alzheimer's disease, and familial breast cancer Pharmacogenomics "custom drugs" Anthropology, evolu?on, and human migra?on muta-ons lineages, mitochondrial inheritance and Y chromosomes compara-ve genomics: for understanding diseases and traits.

13 Prac-cal applica-ons of varia-on DNA forensics Iden-fica-on of suspects exonerate innocents catastrophe vic-ms endangered species (against poachers) Agriculture, livestock breeding Disease-, insect-, and drought- resistant crops Healthier, more produc-ve, disease- resistant farm animals More nutri-ous produce Reducing the costs of agriculture

14 Talk outline Gene-c varia-on Different types Origins Why are all those variants important? Importance and prac-cal applica-ons How is varia-on data discovered? Inves-ga-ng gene-c varia-on and progress over -me Ensembl and modern Bioinforma-cs Building infrastructure for research Interpre-ng variants

15 Mendel ( ) "father of gene-cs" for his study of the inheritance of traits in pea plants Published the results of the inheritance of "factors" in pea plants Paaerns in pea traits explained by inherited factors

16 SNP Consor-um (TSC) 1999: private /public collabora-on Share costs to produce a public resource of single nucleo-de polymorphisms (SNPs) Goal: discover SNPs in two years Result: 1.4 million SNPs by people represen-ng several races

17 Loca-on by mapping flanking sequence

18 Genome Sequencing Human Genome Project 13-year project 2001: Human genome working drafts Data unit of approximately 10x coverage of human 10 years and cost about $3 billion Olympics 2012: $19 billion

19 Finding all human SNPs HapMap Project Goal: find all SNPs present across different populations ( all means present at at least 5%) h6p://hapmap.ncbi.nlm.nih.gov/ 3 major popula-ons Alleles and frequencies Tag variants

20 Haplotypes and LD A haplotype can be thought of as a collection of alleles. LD (Linkage Disequilibrium): a measure of how likely two alleles will be inherited together Important project. S-ll very highly regarded today.

21 Associa-on studies Use common SNPs to understand common disease Diseases: diabetes, Crohn s disease, breast cancer, coronary artery disease, bipolar disorder, hypertension, multiple sclerosis, Genome Wide Associa?on Studies (GWAS) E.g. WTCCC 2005 Gather phenotypes

22 Finding all human SNPs 1000 Genome Project - Genome Sequencing 2008: World-wide capacity dramatically increasing Goal: Find genetic variants with frequencies of >1% In 3 weeks data double that of past 13 years Disks (TB) Year Lactose tolerance

23 1000 Genomes Populations AJM African CEU Northern and Western European CHD Chinese MEX Mexican PEL Peruvian GIH Gujarati PUR Puerto Rican CLM Colombian ASW African ACB GWD GBR British IBS Spain Barbados The Gambia YRI Yoruba TSI Toscan FIN Finnish LWK Luhya MKK Maasai CDX Chinese Dai PJL Pakistani CHB Han Chinese JPT Japanese CHS Han (South) KHV Vietnam

24 Today 2012: Every 14 minutes ( 4000) 600 exome Rare disease: 1 in 17 people in the UK There are over 6,000 recognised rare diseases. DDD: Deciphering Developmental Disorders Ongoing projects: UK10K: 6000 cases, 4000 controls

25 Challenges: for EBI and our users Scientist Sequencing machine Timothy K. Stanton

26 Talk outline Gene-c varia-on Different types Origins Why are all those variants important? Importance and prac-cal applica-ons Where did they all come from? Inves-ga-ng gene-c varia-on and progress over -me Ensembl and modern Bioinforma-cs Building infrastructure for research Interpre-ng variants

27 EBI EBI s mission: To provide freely available data and bioinforma?cs services to all facets of the scien-fic community to promote scien-fic progress The world s most comprehensive collec-on of molecular databases: from DNA and protein sequence to complex pathways and networks Integra-on and community engagement is at the heart of these efforts

28 Genome- wide data from Ensembl Genomic alignments Chromosomes Genes Pick a genome Synteny Gene regulation Across species Orthology SNPs Within species 28 Ensembl s mission: to enable genomic science

29 Species with variation data in Ensembl

30 Data access - variants on the genome

31 Data access- variants per protein

32 Ensembl Varia-on

33 Varia-on annota-on phenotype data

34 Consequence Types Regulatory ATG CODING Synonymous CODING Non-Synonymous AAAAAAA ENST 5 Upstream 5 UTR Splice sites INTRONIC 3 UTR 3 downstream A SNP can be in an exon in some transcripts, and in an intron in another.

35 Consequences of variants in the protein-coding sequence GAG >GAA Glu > Glu Synonymous (silent) no change in amino acid GAG >GGG Glu > Gly GAG >TAG Glu > STOP Non-synonymous (missense) change in amino acid Stop gain (nonsense) introduces a stop codon

36 Added more detailed terms 5 3 regulatory region TF binding site intergenic upstream 5 prime UTR splice donor initiator codon splice acceptor synonymous variant missense variant inframe insertion inframe deletion stop gained frameshift variant coding sequence variant splice region intron variant stop lost stop retained variant incomplete terminal codon 3 prime UTR downstream

37 Data access- varia-on annota-on Ref Reads A A C A C A Structural variants SNP? In-dels GWAS

38 Interpreta-on of variants Interpreta-on of variants is key Ensembl is well placed for doing this with contribu-ons from all: High- quality evidence- based gene build Mul-ple alignments Regulatory informa-on Varia-on and phenotype informa-on VEP for all types of varia-on Good support Fast script version REST API

39 Variant Effect Predictor 39/75

40 Variant Effect Predictor 40/75

41 Summary Importance of variants: their roles in disease and phenotypes differences Classes of variants Short (single nucleo-de) variants: SNPs, indels Structural variants Effects of variants: non- synonymous, stop lost etc. Source of variants: dbsnp, Muta-on databases Big projects: 1000 Genomes, HapMap Bioinforma-cs infrastructure projects: Ensembl

42