In silico variant analysis: Challenges and Pitfalls

Size: px
Start display at page:

Download "In silico variant analysis: Challenges and Pitfalls"

Transcription

1 In silico variant analysis: Challenges and Pitfalls Fiona Cunningham Variation annotation coordinator EMBL-EBI

2 Sequencing -> Variants -> Interpretation Structural variants SNP? In-dels What is known about these variants? What can you say about unknown variants?

3 Deciphering gene-disease relationships 3 billion bases 4 million variants 21,000 coding variants missense protein truncating variants??? Possible disease calling variants

4 Deciphering gene-disease relationships 3 billion bases 4 million variants 21,000 coding variants missense protein truncating variants??? Possible disease calling variants

5 New assembly, GRCh38 = new variants 3.6Mb novel sequence 153 genes that are only on alts

6 Deciphering gene-disease relationships 3 billion bases 4 million variants 21,000 coding variants missense protein truncating variants??? Possible disease calling variants

7 Discrepant variant calling RefSeq vs GENCODE BRCA2 transcript SNP rs c.7397 RefSeq transcript NM_ C>T Ancestral allele, non-reference GENCODE / Ensembl transcript ENST T>C Non-ancestral but in GRCh38

8 Discrepant variant calling RefSeq vs GENCODE BRCA2 transcript SNP rs c.7397 ENST : SNP called in 91% of AFR NM_ : SNP called in 9% of AFR

9 Deciphering gene-disease relationships 3 billion bases 4 million variants 21,000 coding variants missense protein truncating variants??? Possible disease calling variants

10 Ensembl Variant Effect Predictor (VEP) In silico analysis of variants VEP predicts consequences of all variants: SNPs, indels, Structural variants: insertion, deletion, duplication, tandem duplication For effects on coding and non-coding regions In any species Flexible and extensible Commitment to user support McLaren et al (Bioinformatics), McCarthy et al (GenomeMedicine)

11 ESP OMIM Regulation Ensembl dbs Variants Ensembl dbs Compara Ensembl dbs Core Ensembl dbs

12 VEP is built on Ensembl

13

14 VEP web Input form

15 VEP web - output

16 Information For known variants: ESP Natural variation data: allele frequencies, ethnicity, MAF from 1000 Genomes, ESP populations, ExAC Clinical significance data (ClinVar), LOVD data For all variants: Gene and transcript identifiers, exon and intron numbers Consequence (SO terms), SIFT, Polyphen Genomic, cdna, CDS and protein coordinates Amino acid and codon change TFBS: position within motif, if high info position, motif score change HGVS nomenclature

17 VEP: instant, web, script, REST Instant VEP Web interface Perl script REST API XML Maximum speed: up to 3,000,000 variants an hour Perl script: most extensible and flexible, off-line for private data REST is optimal for integration into other systems 15,000 variants per second End points for variants and SVs McLaren et al (Bioinformatics), McCarthy et al (Genome Medicine)

18 VEP plugins

19 Deciphering gene-disease relationships 3 billion bases 4 million variants 21,000 coding, missense variants loss of fu missense protein truncating variants G2P??? Possible disease calling variants

20 Integrating curated data: gene 2 phenotype Collaboration with David FitzPatrick and Helen Firth

21 Gene2phenotype: search

22 Gene2phenotype: data

23 G2P Gene to Phenotype Database DD G2P Cardiac G2P G2P Ear G2P Eye G2P Skin G2P

24 Acknowledgements Funding European Commission Framework Programme 7

25 Acknowledgements G2P Anja Thormann David FitzPatrick Helen Firth

26

27 Future Regulatory regions and eqtls Nearest gene plugin Using LD to infer eqtls GTEx project data Splicing for VEP dbscsnv plugin Indels by SIFT (e.g. Provean) Protein structure, pathways (Reactome, PDBe) Integration with gene lists e.g. DDG2P

28 Deciphering gene-disease relationships 3 billion bases 4 million variants 21,000 coding variants 10,000 non-synonymous variants loss of function variants

29 How VEP works 5 UTR Intronic Input Regulatory ID: rs12345 MAF: 0.05 PubMed: , Ref Alt Leu Asn His TTG AAC CAT TTG AAA CAT Leu Lys His Missense Core Regulatory Variants Ensembl dbs

30 VEP script - advantages Faster Off-line access Any species Your data is secure Additional datasets Extend functionality

31 Link to variants in Ensembl with disease

32 VEP Regulatory data

33 VEP Cell types Regulatory build: 17 cell types: segmentation analysis