Aligning GENCODE and RefSeq transcripts By EMBL-EBI and NCBI

Size: px
Start display at page:

Download "Aligning GENCODE and RefSeq transcripts By EMBL-EBI and NCBI"

Transcription

1 Aligning GENCODE and RefSeq transcripts By EMBL-EBI and NCBI Joannella Morales, Ph.D. LRG Project Manager

2 Outline I will present a recent effort between the EBI and the NCBI to select and align a joint set of transcripts across the genome Background Strategy Progress Next Steps

3 Background The proper interpretation of clinically-relevant variants is dependent on accurate genome annotation. Annotation: Identification and characterisation of functional elements in a genome Protein-coding genes, Non-coding RNAs, Pseudogenes Introns/Exons CDS UTRs Regulation Clinical relevance of a variant only becomes clear when we understand the biology at a given locus

4 Transcript Sources: RefSeq and Ensembl/GENCODE NCBI s RefSeq: NMs: manually annotated XM: automatically produced Transcripts don t necessarily match the genome assembly: represent a prevalent, 'standard' allele but not reference. Independent of reference assembly changes Clinical annotation predominantly done using RefSeq transcripts Ensembl/GENCODE: ENSTs: More manually-reviewed transcripts Must match reference genome On average more transcripts per gene Reference set for gnomad/ ExAC, GTEx, Decipher, 100,000 Genomes Project, COSMIC, ICGC

5 Challenges: Transcript choice Each set has distinctives and both are currently in use Often there are numerous transcripts per locus One required for consistent and unambiguous reporting How do we pick one? Example: PAX6, eye disorders 11 NMs 81 ENSTs Traditionally, the longest (encoding most exons) has been used However, this may not be the most relevant

6 Challenge: Transcript identity Relationship between the two sets is unclear The Consensus CDS project (CCDS) is an effort to harmonise the two However, often identity is assumed only CDS and only notes whether exon/intron boundaries are same However, only ~ 15% of transcripts have exact match in other set Translation between the two is not straightforward

7 Strategy Simplify transcript choice for the community by identifying one select transcript model per locus 100% identity between RefSeq and GENCODE for select transcript model to facilitate bidirectional exchange of data 50% by October, the rest 2018/2019 Selecting one is NOT easy! Community input via survey

8 Community input Almost 800 responses For this preliminary analysis, ~ 20% Clinical

9 Do you want us to provide one primary transcript? Clinical Non-clinical Yes 62% I'm not sure 20% No 18% Yes 48% I'm not sure 25% No 27%

10 In the case of a gene WITHOUT any known clinically relevant variants, which transcript should be the primary? Longest Abundant Clinical Non-Clinical 50% Abundant 50% Longest 77% Abundant 23% Longest

11 In the case of a gene WITH clinically relevant variants, which transcript should be the primary? Longest Most variants Abundant Clinical Non-Clinical 15% Historical 9% Longest 14% Historical 9% Longest 65% Covers most variants 42% Abundant 35% Covers most variants

12 In the case of a gene WITH clinically relevant variants, which transcript should be the primary? Overall abundant Abundant in tissue Clinical Non-Clinical 13% Historical 70% Abundant In tissue 17% Overall Abundant 12% Historical 44% Abundant In tissue 44% Overall Abundant

13 Summary of preliminary survey results One Transcript Take into account: Coverage of clinical variants Abundance Tissue specificity Green Light!

14 What about LRGs? Isn t this what the LRG Project does? LRG: A manually curated record that contains stable reference sequences for reporting clinically relevant variants.

15 Evolution of the LRG project? Both are produced by the same two groups In sync as much as possible select Reference based (GRCh38) Automated selection of transcripts, with One transcript per gene Genome-wide NM and ENST 100% match, UTR to UTR Reduced upgrading, but allowed when LRG Preferred, but non-standard alleles allowed Manual Minimal set, more than one allowed Clinical focus, upon request NM and ENST 100% match, UTR to UTR Stable

16 How are we selecting the select?

17 Automation Ensembl Pipeline Length Evidence Conservation Representation in UniProt and RefSeq Coverage of pathogenic variants RefSeq Select Pipeline Expression Conservation Representation in UniProt and Ensembl Length

18 Process Independent pipelines Compare outputs Identify differences, separate into bins by type Manually review a small number from each bin Make changes to the pipelines Compare outputs again

19 The comparison Bin #Genes % Identical 2811 ~15 Same CDS, same splices, Same CDS, diff splices ~ ~ 10 Amber Differences in ends (UTRs) diff CDS, same splices 67 ~0.35 diff CDS, diff splices 2068 ~ 11 Red Differences in coding sequence Last 10% will be hard Improvements to pipelines and manual curation

20 How are we achieving 100% identity?

21 Aligning the ends (Amber bin) Shared annotation Agreed to use same data sets Jointly defined criteria for starts and ends: Longest strong For 5 UTR - Built algorithm to find a start that is strong (common) and that covers as many start positions as possible Next step: Change ends in automated manner

22 Next Steps Goal is to achieve 50% by October (GA4GH and ASHG) Make improvements to the pipelines Manual review (by both groups) of difficult cases Making good progress, have been able to resolve discrepancies

23 Underlying questions and caveats What if we need more than one transcript? What if we don t have sufficient information to determine transcriptional specificity? What if abundant and clinically relevant are different? What about the fact that for some loci, we don t have good biological information? It s a place to start Important Caveat! Use all for interpretation!

24 Case study: FGF2 Changes to pipelines Ensembl updated the annotation for FGF2 so the transcript RefSeq is built to select the strongest takes into CAGE account peak. RNASeq data Ensembl-HAVANA regularly review Intropolis data (intron reads) Manual review of bins suggest that RNASeq and Intropolis data correlate Changes to CARS pipeline to take into account Intropolis data Asked for RefSeq review of NM_ , with a view to trim it to the strongest peak? RefSeq: we ve decided to create a second RefSeq where the 5' end matches Ensembl and your translation from the well-conserved AUG is represented.

25 Vinita Joardar Kelly McGarvey Alex Astashyn Terence Murphy Kim Pruitt Ray Tully Donna Maglott

26 RPL19 Example of UTR matching Longest strong Strongest in p1 promoter

27 Questions?

28 Example: Transcript 1 Originally identified and used for many years Variants here ignored; assumed to be UTR Transcript 2 Predominant in brain, most clinically relevant Pathogenic variants associated with: Disease Diseas e