Bioinformatika a výpočetní biologie. KFC/BIN VI. Geny predikce a ontologie

Size: px
Start display at page:

Download "Bioinformatika a výpočetní biologie. KFC/BIN VI. Geny predikce a ontologie"

Transcription

1 Bioinformatika a výpočetní biologie KFC/BIN VI. Geny predikce a ontologie RNDr. Karel Berka, Ph.D. Univerzita Palackého v Olomouci

2 Predikce genů gene is "a locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions, and or other functional sequence regions allele is one variant of that gene (e.g. "good genes, "hair color gene") Gregor Mendel Predikce: rozdílný informační obsah kódujících (CDS) a nekódujících (UTR) sekvencí v genomu.

3 informační obsah i l-ve you hr-jlka ds

4 The value of genome sequences lies in their annotation Annotation Characterizing genomic features using computational and experimental methods Genes: Four levels of annotation Gene Prediction Where are genes? What do they look like? Domains What do the proteins do? Role What pathway(s) involved in? 4

5 Kolik má člověk genů? Consortium: 35,000 genes? Celera: 30,000 genes? Affymetrix: 60,000 human genes on GeneChips? Incyte and HGS: over 120,000 genes? GenBank: 49,000 unique gene coding sequences? UniGene: > 89,000 clusters of unique ESTs? 5

6 Current consensus (in flux ) 20,000 known genes (2010) (similarity to previously isolated genes and expressed sequences from a large variety of different organisms) known in ,333 predicted (RefSeq) problémy s predikčními algoritmy (nízká účinnost) (Nature blog 2010) 6

7 How to we get from here 7

8 to here, 8

9 What are genes? - 1 Complete DNA segments responsible to make functional products Products Proteins Functional RNA molecules RNAi (interfering RNA) rrna (ribosomal RNA) snrna (small nuclear) snorna (small nucleolar) trna (transfer RNA) 9

10 What are genes? - 2 Definition vs. dynamic concept Consider Prokaryotic vs. eukaryotic gene models Introns/exons Posttranscriptional modifications Alternative splicing Differential expression Genes-in-genes Genes-ad-genes Posttranslational modifications Multi-subunit proteins 10

11 Prokaryotic gene model: ORF-genes Small genomes, high gene density Haemophilus influenza genome 85% genic Operons One transcript, many genes No introns. One gene, one protein Open reading frames (ORF) One ORF per gene ORFs begin with start, end with stop codon (DNA) - TAG ("amber") UAG - TAA ("ochre") UAA - TGA ("opal" or "umber"). UGA Mnemonic UGA: "U Go Away" UAA: "U Are Away" UAG: "U Are Gone" TIGR: 11

12 Eukaryotic gene model: spliced genes Posttranscriptional modification 5 -CAP, polya tail, splicing Open reading frames Mature mrna contains ORF All internal exons contain open read-through Pre-start and post-stop sequences are UTRs Multiple translates One gene many proteins via alternative splicing 12

13 Expansions and Clarifications ORFs Start triplets stop Prokaryotes: gene = ORF Eukaryotes: spliced genes or ORF genes Exons Remain after introns have been removed Flanking parts contain non-coding sequence (5 - and 3 -UTRs) 13

14 Where do genes live? V genomech Příklad: lidský genom 3,274,571,503 bp (Ensembl 2010) 25 chromosomes : 1-22, X, Y, mt 22,333 genes (RefSeq estimate 2010) 128 nucleotides (RNA gene) 2,800 kb (DMD) Ca. 25% of genome are genes (introns, exons) Ca. 1% of genome codes for amino acids (CDS) 30 kb gene length (average) 1.4 kb ORF length (average) 3 transcripts per gene (average) 14

15 Repeats Genomic sequence features Transposable elements, simple repeats RepeatMasker ( Genes Vary in density, length, structure Identification depends on evidence and methods and may require concerted application of bioinformatics methods and lab research Pseudo genes Look-a-likes of genes, obstruct gene finding efforts. Non-coding RNAs (ncrna) trna, rrna, snrna, snorna, mirna trnascan-se, COVE ( 15

16 Gene identification Homology-based gene prediction Similarity Searches (e.g. BLAST, BLAT) Genome Browsers RNA evidence (ESTs - Expressed sequence tag in cdna) Ab initio gene prediction Gene prediction programs Prokaryotes ORF identification Eukaryotes Promoter prediction PolyA-signal prediction Splice site, start/stop-codon predictions 16

17 Gene prediction through comparative genomics Highly similar (Conserved) regions between two genomes are useful or else they would have diverged If genomes are too closely related all regions are similar, not just genes If genomes are too far apart, analogous regions may be too dissimilar to be found 17

18 Genome Browsers Generic Genome Browser (CSHL) NCBI Map Viewer Ensembl Genome Browser UCSC Genome Browser genome.ucsc.edu/cgi-bin/hggateway?org=human Apollo Genome Browser 18

19 Gene discovery using ESTs Expressed Sequence Tags (ESTs) represent sequences from expressed genes. If region matches EST with high consensus then region is probably a gene or pseudogene. EST overlapping exon boundary gives an accurate prediction of exon boundary. 19

20 Ab initio gene prediction Prokaryotes ORF-Detectors Eukaryotes Position, extent & direction: through promoter and polya-signal predictors Structure: through splice site predictors Exact location of coding sequences: through determination of relationships between potential start codons, splice sites, ORFs, and stop codons 20

21 How it works I - ORF swf film 21

22 How it works I Motif identification Exon-Intron Borders = Splice Sites Exon Intron Exon ~~gaggcatcag gtttgtagac~~~~~~~~~~~tgtgtttcag tgcacccact~~ ~~gaggcatcag GTttgtagac~~~~~~~~~~~tgtgtttcAG tgcacccact~~ ~~ccgccgctga gtgagccgtg~~~~~~~~~~~tctattctag gacgcgcggg~~ ~~ccgccgctga GTgagccgtg~~~~~~~~~~~tctattctAG gacgcgcggg~~ ~~tgtgaattag gtaagaggtt~~~~~~~~~~~atatctccag atggagatca~~ ~~tgtgaattag GTaagaggtt~~~~~~~~~~~atatctccAG atggagatca~~ ~~ccatgaggag gtgagtgcca~~~~~~~~~~~ttatttccag gtatgagacg~~ ~~ccatgaggag GTgagtgcca~~~~~~~~~~~ttatttccAG gtatgagacg~~ Splice site Splice site Motif Extraction Programs at 22

23 How it works III The (ugly) truth 23

24 Gene prediction programs Homology use BLAST-like Example: Exofish, CRITICA Rule-based programs Use explicit set of rules to make decisions. Example: GeneFinder Neural Network-based programs Use data set to build rules. Examples: Grail, GrailEXP, Genemark Hidden Markov Model-based programs Use probabilities of states and transitions between these states to predict features. Examples: Genscan, GenomeScan 24

25 Tools ORF detectors NCBI: Promoter predictors CSHL: BDGP: fruitfly.org/seq_tools/promoter.html ICG: TATA-Box predictor PolyA signal predictors CSHL: argon.cshl.org/tabaska/polyadq_form.html Splice site predictors BDGP: Start-/stop-codon identifiers DNALC: Translator/ORF-Finder BCM: Searchlauncher 25

26 CRITICA prediction of prokaryotic genes search for RBS (ribosomal binding site, Shine-Dalgarno sequence) Principle: TBLASTP against protein database and choosing clearly coding parts (usually only parts of the genes). Calculating of statistical model. Prediction of genes. New statistical model and new prediction etc etc.

27 Genscan prediction of eukaryotic genes different statistical models for the first and last exon search for promotores, terminators, polya signal different statistical parameter for different GC www:

28 Genscan probability exons exactly partialy overlap s error % 0.9% 0.0% 1.4% % 3.4% 0.2% 4.0% % 6.1% 0.4% 5.7% % 16.0% 1.2% 8.0% % 26.2% 2.2% 17.4% % 27.8% 4.0% 38.3%

29 Genscan - example GENSCAN 1.0 Date run: 31-Oct-100 Time: 15:54:20 Sequence HERV17_ : bp : 37.79% C+G : Isochore 1 ( C+G%) Parameter matrix: HumanIso.smat Predicted genes/exons: Gn.Ex Type S.Begin...End.Len Fr Ph I/Ac Do/T CodRg P... Tscr Init Term PlyA Prom Init Term PlyA PlyA Term Intr

30 Genscan - example

31 Kvalita predikce real předpověď TP RP PP FP TN FN TP FN RN PN Sensitivity = TP / (TP + FN) How many genes were found out of all present? Specificity = TP / (TP + FP) How many predicted genes are indeed genes? TP. TN + FP. FN Correlation Coefficient = PP. PN + RP. RN

32 Gene prediction accuracies Nucleotide level: 95%Sn, 90%Sp (Lows less than 50%) Exon level: 75%Sn, 68%Sp (Lows less than 30%) Gene Level: 40% Sn, 30%Sp (Lows less than 10%) Programs that combine statistical evaluations with similarity searches most powerful. 32

33 Common difficulties First and last exons difficult to annotate because they contain UTRs. Smaller genes are not statistically significant so they are thrown out. Algorithms are trained with sequences from known genes which biases them against genes about which nothing is known. Masking repeats frequently removes potentially indicative chunks from the untranslated regions of genes that contain repetitive elements. 33

34 The annotation pipeline Mask repeats using RepeatMasker. Run sequence through several programs. Take predicted genes and do similarity search against ESTs and genes from other organisms. Do similarity search for non-coding sequences to find ncrna. 34

35 Annotation nomenclature Known Gene Predicted gene matches the entire length of a known gene. Putative Gene Predicted gene contains region conserved with known gene. Also referred to as like or similar to. Unknown Gene Predicted gene matches a gene or EST of which the function is not known. Hypothetical Gene Predicted gene that does not contain significant similarity to any known gene or EST. 35

36 Luis Tari Gene Ontology (GO) URL: Gene Ontology is A hierarchy of roles of genes and gene products independent of any organism. Composed of three independent ontologies: molecular function, biological process, cellular component GO itself does not contain any information on genes or gene products

37 Gene Ontology Developed by an international consortium about 50 members Editorial office, 4 full-time editors (ish) Many other part-time editors at databases Multiple changes made a day made live immediately

38 Evolution of GO GO development traditionally annotation-driven development directed by use Terms added as new species annotated Terms added on as as-needed basis Resulted in organic structure, little formality Ontological formality added subsequently philosophical and logical

39 Growth of GO GO term history obsolete undefined terms defined terms Jan-01 Apr-01 Jul-01 Oct-01 Jan-02 Apr-02 Jul-02 Oct-02 Jan-03 Apr-03 Jul-03 Oct-03 Jan-04 Apr-04 Date Jul-04 Oct-04 Jan-05 Apr-05 Jul-05 Oct-05 Jan-06 Apr-06 Jul-06 Oct-06 Jan-07

40 GO annotations nnotations.shtml Curators annotate their findings of genes (known as annotations) by utilizing GO for various organisms (about 20 of them). Different kinds of evidence codes Annotations with IEA (inferred from electronic annotation) evidence code are not manually verified (Least reliable) Luis Tari

41 Structure of GO relationships

42 GO Molecular Function Ontology Describes activities, such as catalytic or binding activities, that can be performed by individual gene products or assembled complexes of gene products at the molecular level. Example of activities transporter activity Genes that enable the directed movement of substances (such as macromolecules, small molecules, ions) into, out of, within or between cells. Example of binding insulin receptor binding Genes that interact with insulin receptors Luis Tari

43 GO Biological Process Ontology Defined as a biological objective to which the gene or gene product contributes. Examples cell proliferation Genes that are responsible for the multiplication or reproduction of cells, resulting in the rapid expansion of a cell population. learning/memory Genes that e acquisition and processing of information and/or the storage and retrieval of this information over time. Luis Tari

44 GO Cellular Component Ontology Refers to the place in the cell where the gene product is active. Examples bud nucleus cell membrane

45 Luis Tari GO An example showing a partial hierarchy of the Gene Ontology that involves the term apoptosis. Snapshot taken from the TGen GOBrowser.

46 Luis Tari Example of a gene product A gene product has one or more molecular functions and is used in one or more biological processes; it might be associated with one or more cellular components. An example showing all occurrences of SODC in the Gene Ontology from the human annotation.

47 Common applications of GO Analysis of microarray data Finding genes with similar functions Utilize biological process ontology Evaluation of protein-protein interactions Proteins are likely to interact if they are in the same location Utilize cellular component ontology Luis Tari

48 Luis Tari Extension to Ontology? We know that APOE is involved in Alzheimer s disease. Based on the Gene Ontology annotation, APOE is involved in learning and/or memory biological process. If we ask is the gene APOE related to Alzheimer s disease? Yes, because APOE is known to be involved in learning and/or memory. BUT there is NO ontology that says learning and/or memory can influence Alzheimer s disease Degradation of ubiquitin cycle can cause extra long/short half-life of genes Extra long/short half-life of genes can cause cancer

49 Credits tations/hhmi_2003/2003_3.ppt Paces a Vondrasek, kurz Bioinformatiky, UK