Introduction to Next Generation Sequencing (NGS) Andrew Parrish Exeter, 2 nd November 2017
Topics to cover today What is Next Generation Sequencing (NGS)? Why do we need NGS? Common approaches to NGS NGS Workflow
What is Next Generation Sequencing? (NGS) Historically we have used Sanger sequencing to investigate genetic diseases This looks at one stretch of DNA from one patient at a time (~600 base pairs in length) Measures fluorescence given off when dye labelled nucleotides are excited by a laser to determine order of bases
What is Next Generation Sequencing? (NGS) NGS (also referred to as high throughput sequencing or massively parallel sequencing) Generates hundreds of millions of overlapping short sequences (up to 300bp) in a single run These have to be computationally put back together Can look at multiple patients in one run
Why do we need Next Generation Sequencing? (NGS) Human Genome project took 15 years to complete using Sanger based technology at an estimated cost of $3 billion Today, using NGS, this could be completed in a day or two for under $1000
Common approaches to NGS Targeted panels (tngs) Pull out specific genes from the patient s DNA and only obtain the sequence data from these genes (up to about 150 genes) Rare disease / Medeliome / Clinical exome Essential a very large (6,110 genes) panel that looks at the exons of genes known to cause human disease (at the time of design!) Whole exome Looks at the exons of 23,244 expressed genes that encode 1-2% of the human genome Genome sequencing Looks at the complete (ish) DNA sequence from a patient
Single gene disease Common approaches to NGS Easily clinically recognisable disease Single genetic aetiology (mutations in one gene cause this disorder) Existing tests widely available in diagnostic laboratories Small number of genes for a disease Clinically recognisable disease Multiple sub-types caused by mutations in different genes Highly developed clinical expertise and knowledge available in specialist centres Large number of possible causes (or no known cause) Strong suggestion of monogenic disease, but no clear clue to which gene to test
Workflow for NGS Patient Extract DNA, prepare library and sequence Raw Reads (FASTQ) Assess quality and process reads Processed reads (FASTQ) Map to reference genome Assess depth and breadth of coverage Aligned Reads (SAM/BAM) Call variants (VCF) Variant and sample quality control Annotate variants Filter and prioritise variants Integrate with clinical data Shortlist of disease related variants Diagnostic report Visualise data Visualise data
Workflow for NGS Patient Extract DNA, prepare library and sequence Call variants Visualise data Quality Control Annotate variants Map to reference genome Shortlist of disease related variants Visualise data Diagnostic report
DNA extraction and library preparation Genomic DNA Fragment Target Attach adaptors for paired end sequencing
Sequencing
Read mapping After base calling, align/map sequences onto reference genome Determine coordinates (chromosomal position) and add basic annotations (coding, non-coding, etc) if known ATCTTGTAGG GAAACACAAAGTG GTCTAGGGAAGAAGG.. TAGTACCCCATCTTGTAGGTCTGAAACACAAAGTGTGGGGTGTCTAGGGAAGAAGGTGTGTGACCAGGGAGGTCCC.. Reference Genome
Read mapping
Coverage Vertical coverage how many times a particular base has been sequenced (e.g. 20X, 30X etc.) Greater depth of coverage means improved accuracy for variant detection (but is more expensive) Horizontal coverage how much of the genome has been sequenced Greater target size means more genome is sequenced (but is more expensive)
Coverage
Variant calling
Variant calling #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT germline chr4 27668. T C 8.65. DP=2;AF1=1;AC1=4; GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3 chr4 27669. G T 4.77. DP=2;AF1=1;AC1=4; GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3 chr4 27712. T C 44. DP=2;AF1=1;AC1=4; GT:PL:DP:SP:GQ 1/1:40,3,0:1:0:8 chr4 27774. G A 5.47. DP=2;AF1=0.5011; AC1=2; GT:PL:DP:SP:GQ 0/1:34,0,23:2:0:28 chr4 36523. A T 10.4. DP=1;AF1=1;AC1=4; GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3
Variants A variant is a DNA sequence that is different to the normal sequence for a particular species. These should be named according to standardised nomenclature (HGVS) This allows consistent reporting and must include: Reference sequence - e.g. NM_0000123.4 cdna change - e.g. c.123a>g Protein change - e.g. p.(v59m) or p.(val59met)
Variant types The sun was hot but the man did not get his hat. SNV a change to a single base pair The sun wos hot but the man did not get his cat. The sun was.ot but the man did not get his hat. Small insertion/deletion (InDel) in frame The sun hot but the man did not get his hat. The sun was too hot but the man did not get his hat. Small insertion/deletion (InDel) frameshift The sun wah otb utt hem and idn otg eth ish at The sun wwa sho tbu tth ema ndi dno tge thi sha t
Variant pathogenicity A variant is pathogenic if it interferes with normal protein production. There are many ways that this can happen! Regulatory region Change amino acid Change splice site, add intron Frameshift, causing stop codon later New stop codon Change splice site, remove exon Stop codon
Variant prioritisation Frameshift and stop gain (nonsense substitution) variants are highly likely to be pathogenic. Splicing variants are likely to be pathogenic, but need checking with a splicing predictor. Missense variants can be pathogenic, and there are in-silico tools to predict the effect. The effect depends on how the amino acids are changed. Synonymous substitutions are very unlikely to be pathogenic unless they affect splicing.
Variant prioritisation ~30,000 variants Exclude common variants Identify potential pathogenic mutation(s) Causal mutation(s)
Variant annotation and filtering We can pull information in from a variety of external sources, including: Population databases, e.g. ExAC and dbsnp These provide an approximation of the variants that are common in the population and may be excluded from consideration Disease databases, e.g. HGMD These provide a list of the known disease causing mutations seen in a variety of settings and may be a flag for prioritisation In silico analysis packages, e.g. SIFT, PolyPhen Phenotypic terms provided by clinician using HPO
Questions?