Genomics: Human variation

Size: px
Start display at page:

Download "Genomics: Human variation"

Transcription

1 Genomics: Human variation Lecture 1 Introduction to Human Variation Dr Colleen J. Saunders, PhD South African National Bioinformatics Institute/MRC Unit for Bioinformatics Capacity Development, University of the Western Cape, South Africa Introduction to Bioinformatics online course : IBT Dr Colleen J. Saunders

2 Learning Objectives At the end of this lecture you will: Have an understanding of the major projects undertaken to document Human variation Have an overview of the different types of genetic variation

3 Human Variation Human genomes are ~99,5% similar across all individuals Variations arise through mutation Maintained by natural selection or neutrality Range from large karyotype differences to single bp changes Karyotype = no. & appearance of chromosomes in nucleus Human Genome = 3,2 billion base pairs Variants occur ~0,3-1 kb 5-10 million variants within individual genome compared to another

4 Cataloging Human Variation The Human Genome Project

5 Human Variation Single Nucleotide Polymorphisms (SNPs) Most common form of variation Substitution of 1 single nucleotide for another (A,C,T,G) Insertions & Deletions (INDELs) Small indelsof 1-2bp Repetition of nucleotide patterns = Variable Number Tandem Repeats (VNTRs) Minisatellites (10-100bp) Microsatellites / Simple Tandem Repeats (2-6bp) Copy Number Variations (CNVs) Deletion or duplication of larger regions of chromosome Gene dosage

6 Single Nucleotide Polymorphisms single base change occurring at a frequency >1% in 1 population <1% = mutations / rare SNPs Be careful disease causing mutations may occur at higher frequencies! SNP s occur less frequently in exons ~50% exonic SNPs are non-synonymous Many SNP s do not confer any functional change ( silent ) Others may affect protein AA sequence Regulation/expression mrna stability

7 Single Nucleotide Polymorphisms CODING SNPs: Occur in coding regions of the gene (exons) SYNONYMOUS: No change in amino acid May alter mrna stability NON-SYNONYMOUS/MISSENSE: Change the amino acid sequence of the protein NONSENSE: insertion of a stop codon INDELS: disrupt codon sequence

8 IUPAC Ambiguity codes for SNPs

9 Single Nucleotide Polymorphisms NON-CODING SNPs: Occur in regulatory/intronic/intergenic regions Many are silent i.t.o current knowledge! May alter transcription Located in promoter, silencer or enhancer regions Transcription factor binding sites May alter mrna stability & folding & affect expression

10 Working with variant data HYPOTHESIS DRIVEN variant prioritization: - Candidate gene association study - Candidate variant of interest - Small(er) numbers of variants HYPOTHESIS FREE variant prioritization: - NGS or GWAS studies - Investigate large numbers of variants

11 For more information COPY NUMBER VARIANTS Database of Genomic Variation A curated catalogue of human genomic structural variation AMERICAN SOCIETY OF HUMAN GENETICS Education resources: Human variation ENSEMBL Human variation help page

12 Genomics: Human variation Lecture 2 Linkage Disequilibrium Dr Colleen J. Saunders, PhD South African National Bioinformatics Institute/MRC Unit for Bioinformatics Capacity Development, University of the Western Cape, South Africa Introduction to Bioinformatics online course : IBT Dr Colleen J. Saunders

13 Learning Objectives At the end of this lecture you will: Have an understanding of linkage disequilibrium and haplotypes

14 Population Genetics Random mating & recombination should ensure mutations spread in the population Recombination events generate new arrangements for ancestral alleles Alleles at neighbouring loci tend to cosegregate may reflect ancestral combinations (haplotypes) Ancestral Linkage Disequilibrium (LD) = non-random association of alleles at different loci

15 Population Genetics Be careful to distinguish between Linkage and LD! Ancestral Linkage is focused on 1 particular locus and recombination in last 2-3 generations LD is focused on particular alleles at a locus and recombination over a much longer period of time

16 Measuring LD D measures the difference between the frequency at which alleles at different loci are inherited together, and the frequency at which we expect to observe those alleles together if they are in equilibrium If D is significantly > 0, the loci are in LD D increases with tighter linkage between the loci D is the absolute ratio of D and its maximum value, D max D'=1 indicates complete LD Recombination over time causes the decay of D' towards 0 r 2 is the measurement of correlation between a pair of loci

17 Linkage disequilibrium varies throughout the human genome Regions of high LD interspersed with regions of very low LD Understanding LD structure results in cost saving for association studies Tight LD means knowing which allele occurs at 1 locus can tell us which allele occurs at the other locus Basis of GWAS High D means that variants are good surrogates for each other D estimates may be increased in small samples & if an allele at 1 locus is rare

18 Haplotypes & Tag SNPs HAPLOTYPES: blocks of sequence along a chromosome where no recombination occurs Blocks of closely linked alleles that are inherited together All pairs of SNPs within 1 haplotype block are in high LD TAG SNP: SNP that is representative of other SNPs in a haplotype block Can be used to infer the allele present at other loci within block

19 International HapMap Project catalogue common patterns of genetic variation in humans map of haplotype blocks and tag SNPs that identify the haplotypes No. of SNPs required to examine entire genome (~10 million SNPs) to ~ tag SNPs Easier & cost-effective to find disease associated genes & regions Common haplotypes occur in all populations but at different frequencies 270 individuals from CEPH, Han-Chinese, Japanese & Yoruba populations genotyped for 6 million SNPs Beware: by focusing on common variants, may miss rare diseaseassociated variants

20 Tag SNPs Tagger: tool for selection and evaluation of tag SNPs from genotype data Tagger server:

21 Haploview Haploview: Tool designed for haplotype analysis LD & haplotype block analysis Estimate haplotype frequency in a population SNP and haplotype association tests Tagger tag SNP selection Download phased genotype data from HapMap

22 Recommended reading Haplotype block definition and its application Zhu et al, 2004 International HapMap Project

23 Genomics: Human variation Lecture 3 Variant Call Format Files Dr Colleen J. Saunders, PhD South African National Bioinformatics Institute/MRC Unit for Bioinformatics Capacity Development, University of the Western Cape, South Africa Introduction to Bioinformatics online course : IBT Dr Colleen J. Saunders

24 Learning Objectives At the end of this lecture you will: Be familiar with the variant call format (.vcf) file

25 VARIANT CALL FORMAT The Variant Call Format Specification: VCFv4.3 & BCFv2.2 Text file containing sequence variation data Meta-information lines (preceded by ##) Header line Variation data for a particular position in rows May include genotype information for samples BCF = binary, compressed format for large VCF s

26 VARIANT CALL FORMAT META-DATA HEADER LINE VARIANT INFORMATION The Variant Call Format Specification: VCFv4.3 & BCFv2.2

27 VARIANT CALL FORMAT META-INFORMATION: Starts with ## Key=value pairs Fileformat line is always required (line 1) ##fileformat=vcfv4.3 Other meta-data is optional (highly recommended!) Describe the information contained in the file E.g. FILTER will describe quality filters applied to the data

28 VARIANT CALL FORMAT HEADER LINE: Always contains the same fields in position 1-8: CHROM POS ID REF ALT QUAL FILTER INFO FORMAT (If genotypes given) SAMPLE ID s (If genotypes given)

29 VARIANT CALL FORMAT VARIANT INFORMATION: rows indicate variant information for chr. position Missing data is indicated with a. CHROM = chromosome in reference genome POS = position of the variant in the reference genome ID = variant identifier (preferable dbsnp) REF = Reference base/s (allele) ALT Alternate base/s (allele). Can be multiple. QUAL = Phred-scaled quality score for the variant call FILTER = Pass indicates this call has passed all filters

30 VARIANT CALL FORMAT VARIANT INFORMATION: INFO = Additional information Multiple fields separated by ; Sub-fields listed in meta-data AA = Ancestral allele AC = Allele count in genotypes AD = Read depths for each allele BQ = base quality at this position DP = combined depth across samples Etc

31 VARIANT CALL FORMAT VARIANT INFORMATION: If genotype data is reported: FORMAT = specifying type & order of genotype data for each sample Sub-fields separated by : GT = Genotype ( = phased; / = unphased) AD = per sample read depth for each allele DP = per sample read depth at this position MQ = RMS mapping quality Followed by 1 data block per sample (SAMPLE ID s)

32 Recommended reading The Variant Call Format Specification: VCFv4.3 & BCFv2.2 October

33 Genomics: Human variation Lecture 4 Variant prioritisation (part 1) Dr Colleen J. Saunders, PhD South African National Bioinformatics Institute/MRC Unit for Bioinformatics Capacity Development, University of the Western Cape, South Africa Introduction to Bioinformatics online course : IBT Dr Colleen J. Saunders

34 Learning Objectives At the end of this lecture you will: Understand the basic principles of variant prioritisation Have an overview of the available human variation databases

35 Variant Prioritisation NGS pipelines generate large.vcf files WGS experiment yields ~1-1,5 million variants per sample WES ~ How do we filter these to identify those most likely to affect protein function or expression? How can those variants be further filtered to identify the one(s) likely to cause this disease that are good candidates for further investigation

36 Variant Prioritisation Questions a biologist/clinician might ask? What is the frequency of the variant in the general population? This specific population? What part of the gene is it in? Does it affect gene function? Is it in a gene known to be involved in the disease? A related disease/phenotype? in a genome region statistically implicated in the disease? involved with a function/pathway that coincides with the disease pathology? Etc

37 Variant Prioritisation Remove common(?) variants Variants that change the amino acid Variant level Variants that have a functional effect Gene level SNPs in biologically plausible candidate genes

38 Variant Prioritisation STEP 1: FREQUENCY INFORMATION dbsnp Summary of allele frequency across many different datasets including 1000genomes, HAPMAP, HGP, ExAC, ESP6500

39 STEP 1: FREQUENCY INFORMATION ESP Variant Prioritisation ExAC exomes from unrelated individuals `focus is heart, lung & blood disorders exomes from unrelated individuals

40 Variant Prioritisation STEP 2: GENOMIC CONTEXT Allele information Sequence change information Gene region & function information

41 Variant Prioritisation STEP 2: GENOMIC CONTEXT

42 Variant Prioritisation STEP 3: FUNCTIONAL PREDICTION SIFT Coding variants only variant with score <0.05 is predicted as deleterious

43 Variant Prioritisation STEP 3: FUNCTIONAL PREDICTION Coding, non-synonymous SNPs only HVAR for diagnostics of Mendelian diseases HDIV - used when evaluating alleles in complex phenotypes

44 Variant Prioritisation STEP 3: FUNCTIONAL PREDICTION FATHMM Separate coding or non-coding algorithm score damaging (D) or tolerated (T)

45 Variant Prioritisation STEP 3: FUNCTIONAL PREDICTION RegulomeDB Non-coding variants Identifies DNA features and regulatory elements such as transcription factor binding sites

46 Variant Prioritisation STEP 3: FUNCTIONAL PREDICTION Wiki investigating human genetics Variant centered information Links to peer-reviewed publications Cross-referenced to other databases

47 Variant Prioritisation STEP 3: CLINICAL CONSEQUENCE ClinVar Links genomic variation to human health phenotypes Levels of supporting evidence vary but indicated Content is not curated

48 Variant Prioritisation STEP 3: CLINICAL CONSEQUENCE Catalogue Of Somatic Mutations In Cancer Curated input from peer reviewed publications Genome wide screen data

49 Variant Prioritisation STEP 3: PHARMACOGENOMICS

50 Variant Prioritisation STEP 4: GENE PRIORITIZATION Gene centered information Relationships between genotype & phenotype Introduction to Bioinformatics online course : IBT Dr Colleen J. Saunders

51 Variant Prioritisation STEP 4: GENE PRIORITIZATION Searchable database of published mirna sequences & annotation Includes link-outs to databases predicting targets Introduction to Bioinformatics online course : IBT Dr Colleen J. Saunders

52 Variant Prioritisation STEP 4: GENE PRIORITIZATION Manually curated pathway maps representing molecular interaction & reaction networks Explore by pathway or by gene Introduction to Bioinformatics online course : IBT Dr Colleen J. Saunders

53 Variant Prioritisation Gene Knockout models STEP 4: GENE PRIORITIZATION The Mouse Genome Database (MGD) and The Rat Genome Database (RGD) Very often provide the missing link that solves strange genetic disease cases

54 For new molecular biology databases and updates, check out the annual Nucleic Acids Research Database Issue (January each year) Introduction to Bioinformatics online course : IBT Dr Colleen J. Saunders

55 Genomics: Human variation Lecture 5 Variant prioritisation (part 2) Dr Colleen J. Saunders, PhD South African National Bioinformatics Institute/MRC Unit for Bioinformatics Capacity Development, University of the Western Cape, South Africa Introduction to Bioinformatics online course : IBT Dr Colleen J. Saunders

56 Learning Objectives At the end of this lecture you will: Have an overview of the available human variation functional prediction tools

57 Variant Prioritisation Introduction to Bioinformatics online course : IBT Dr Colleen J. Saunders

58 Variant Prioritisation Options to work with GRCh37 co-ordinates Number of different input options Can customize the output

59 Variant Prioritisation Introduction to Bioinformatics online course : IBT Dr Colleen J. Saunders

60 Variant Prioritisation Easy to use Attractive interface Customizable Output files are easy to manipulate Lots of support cs/tools/vep/script/vep_tutorial.h tml Introduction to Bioinformatics online course : IBT Dr Colleen J. Saunders

61 Variant Prioritisation Command line tool Easy to use Output is customizable Lots of support GATK forums Introduction to Bioinformatics online course : IBT Dr Colleen J. Saunders

62 Variant Prioritisation Command line tool written in Perl Extensive documentation and tutorials Updated regularly Gene-based, Region-based & Filter-based annotation Output is customizable Introduction to Bioinformatics online course : IBT Dr Colleen J. Saunders

63 Variant Prioritisation Output is easy to manipulate using command line (large files) or Excel (small variant sets) RefSeq annotation Genomic context & gene detail Region based annotation in a region implicated in GWAS, ENCODE regions, TF binding sites, located in enhancer/repressor elements Filter based annotation - dbsnp identifiers - Allele frequencies - Functional prediction - Conservation scores - Clinical significance Introduction to Bioinformatics online course : IBT Dr Colleen J. Saunders

64 Variant Prioritisation There s a user-friendly web application! Introduction to Bioinformatics online course : IBT Dr Colleen J. Saunders

65 Variant Prioritisation QUALITY: Low quality variant calls are likely to be sequencing errors Filter out low quality variants indicated by QUAL score in.vcf INHERITANCE PATTERN: Dependent on study design Filter on inheritance pattern in clinical NGS experiments MINOR ALLELE FREQUENCY: In Mendelian or rare diseases looking for rare variants Filtering on minor allele frequency drastically reduces data set Frequency cut-offs are study dependent Rare diseases: 1% is good Introduction to Bioinformatics online course : IBT Dr Colleen J. Saunders

66 Variant Prioritisation GENOMIC CONTEXT: Non-synonymous (missense) variants may affect protein function Nonsense variants almost always functional Large indels affect function frameshift indels almost always functional Splice sites are sensitive to mutation Stop-gain/loss, frameshift and splice-site variants are automatically interesting UTR variants don t affect the protein sequence may have an effect if mutation is in regulatory element Introduction to Bioinformatics online course : IBT Dr Colleen J. Saunders

67 Variant Prioritisation FUNCTIONAL PREDICTION: Many different algorithms: SIFT, PolyPhen, LRT, MutationTaster, FATHMM, MutationAssessor, MetaSVM, MetaLR etc Key word is PREDICTION CONSERVATION: Constraint in a genomic region implies non-redundancy Variants in regions that are highly conserved across species are likely to be in genes that serve important biological functions Introduction to Bioinformatics online course : IBT Dr Colleen J. Saunders

68 Acknowledgements: These slides were produced by Dr Colleen J. Saunders for the 2016 H3ABioNet Introduction to Bioinformatics Online Couse and are distributed under the Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International licence. Dr Saunders is supported by a research fellowship funded by the South African Department of Science and Technology and the National Research Foundation. This fellowship is held at the South African National Bioinformatics Institute which houses the South African MRC Unit for Bioinformatics Capacity Development.