Variant prioritization in NGS studies: Annotation and Filtering "

Size: px
Start display at page:

Download "Variant prioritization in NGS studies: Annotation and Filtering ""

Transcription

1 Variant prioritization in NGS studies: Annotation and Filtering Colleen J. Saunders (PhD) DST/NRF Innovation Postdoctoral Research Fellow, South African National Bioinformatics Institute/MRC Unit for Bioinformatics Capacity Development, University of the Western Cape

2 LEARNING OUTCOMES By the end of this session you should: Have a detailed understanding of the variant prioritization pipeline Know what tools are available to annotate NGS.vcf files Understand the concepts of filtering annotated.vcf files Have some practical experience of annotating and filtering.vcf files Useful learning outcomes that are outside the scope of this workshop: A working knowledge of manipulating files via the command line in Linux Proficiency in a programming language such as Python, Ruby, Perl etc

3 DISEASE VARIANT DISCOVERY Typical WGS experiment yields ~1-1,5 million variants WES ~ variants How do we filter these to identify those most likely to affect protein function or expression which variants have impact? how are those identified? How can those variants be further filtered to identify the one(s) likely to cause this disease and are good candidates for further investigation

4 WHOLE EXOME SEQUENCING Different study designs: Single individual Trio - affected child and parents Family affected and unaffected individuals Disease vs normal tissue, e.g. cancer Cohort of unrelated cases versus controls Require different statistical and data processing pipelines but, for ALL designs, a large number of variants are called!

5 VARIANT CALL FORMAT

6 VARIANT PRIORITIZATION Remove common variants Variants that change the amino acid Variant level Variants that have a functional effect Gene level SNPs in biologically plausible candidate genes

7 Annotation hp://

8 Annotation Options to work with GRCh37 co-ordinates Number of different input options Can customize the output

9 Annotation

10 Annotation Ensemble VEP: Easy to use Attractive interface Customizable Output files are easy to manipulate Lots of support hp://

11 Annotation Part of the vtools suite Project based Customizable Not very user friendly Command line tool hp://varianools.sourceforge.net/annota=on/homepage

12 Annotation Command line tool Easy to use Output is customizable Lots of support GATK forums hps://

13 Annotation Main package is written in Perl Command line tool Extensive documentation and tutorials Updated regularly Gene-based, Region-based and Filter-based annotation Output is customizable hp://annovar.openbioinforma=cs.org/en/latest/#annovar-documenta=on

14 Annotation Output is a tab-separated file Easy to manipulate using command line (large files) or Excel (small variant sets) Output includes: RefSeq annotation Genomic context, gene detail Region based annotation in a region implicated in GWAS, ENCODE regions, TF binding sites, located in enhancer/repressor elements etc Filter based annotation - dbsnp identifiers - MAF: ExAC, 1000g, ESP Functional prediction: SIFT, PolyPhen-2, LRT, MutationTaster, MutationAssesor, FATHMM (coding only), MetaSVM, MetaLR - Conservation scores: GERP ++, PhyloP, SiPhy - Clinical significance: ClinVar, COSMIC hp://annovar.openbioinforma=cs.org/en/latest/#annovar-documenta=on

15 Annotation hp://wannovar.usc.edu/ There s a user-friendly web application!

16 Filtering QUALITY: Low quality variant calls are likely to be sequencing errors Filter out low quality variants indicated by the QUAL score in the.vcf INHERITANCE PATTERN: Dependent on study design Filter on inheritance pattern in clinical NGS experiments What genotypes would you expect if the disease follows these inheritance patterns: Autosomal recessive? Autosomal dominant?

17 Filtering GENOMIC CONTEXT: Synonymous exonic variants considered silent & can be discarded Non-synonymous (missense) variants may affect protein function amino acid change does not automatically imply deleteriousness Nonsense variants almost always functional Large indels affect function frameshift indels almost always functional Splice sites are sensitive to mutation Stop-gain/loss, frameshift and splice-site variants are automatically interesting UTR variants don t affect the protein sequence only have an effect if mutation is in regulatory element there are often thousands to evaluate

18 MINOR ALLELE FREQUENCY: Filtering In Mendelian or rare diseases we are looking for rare variants! Filtering on minor allele frequency drastically reduces data set Frequency cut-offs are study dependent Rare diseases: 1% is good Common multifactorial diseases:??? Important concepts related to reference genome used to call variants: Is this allele rare/uncommon in MY population? What if the alternate allele is present in ref genome sequence? hg19!! If your sample is homozygous for alternate allele, it won t be called as a variant Check both sides of the frequency spectrum

19 Filtering FUNCTIONAL PREDICTION: In rare/mendelian disease: nssnvs only kept if they re predicted deleterious by at least one algorithm (SIFT, PolyPhen, LRT, MutationTaster, FATHMM, MutationAssessor, MetaSVM, MetaLR etc) Common multifactorial disease? A variant may be functional even if not predicted to be deleterious What if you come across highly deleterious, highly penetrant disease variants not related to your disease? E.g. BRCA genes Have a strategy to deal with this BEFORE you start

20 Filtering CONSERVATION: Constraint in a genomic region implies nonredundancy Variants in regions that are highly conserved across species are likely to be in genes that serve important biological functions PhyloP, SiPhy, GERP++ etc

21 VARIANT PRIORITIZATION Remove common variants Variants that change the amino acid Variant level Variants that have a functional effect Gene level SNPs in biologically plausible candidate genes