Evolutionary Genetics. LV Lecture with exercises 6KP. Databases

Size: px
Start display at page:

Download "Evolutionary Genetics. LV Lecture with exercises 6KP. Databases"

Transcription

1 Evolutionary Genetics LV Lecture with exercises 6KP Databases HS2018

2 Bioinformatics - R R Assignment The Minimalistic Approach!2

3 Bioinformatics - R Possible Exam Questions for R: Q1: The function below does not work if I used temp(10). Do you know why? Temp <- function(c) { K = C print(paste(c,"c ->,K,"K")) }!3

4 Bioinformatics - R Possible Exam Questions for R: Q1: The function below does not work if I used temp(10). Do you know why? Temp <- function(c) { K = C print(paste(c,"c ->,K,"K")) }!4

5 Bioinformatics - R f <- function(x) { A <- sqrt(x) B <- x^2 print(c(a,b)) } > f(-4)?!5

6 Bioinformatics - R f <- function(x) { A <- sqrt(x) B <- x^2 print(c(a,b)) } > f(-4) NaN 16 Warning message: In sqrt(x) : NaNs produced!x HS17 UniBas JCW

7 Bioinformatics - R f <- function(x) { A <- sqrt(ifelse(x >= 0, x, NA)) B <- x^2 print(c(a,b)) } > f(-4)?!x HS17 UniBas JCW

8 Bioinformatics - R f <- function(x) { A <- sqrt(ifelse(x >= 0, x, NA)) B <- x^2 print(c(a,b)) } > f(-4) NA 16!X HS17 UniBas JCW

9 Bioinformatics - R a <- 5:9 if(a <= 6) a*2 else a-1 a=5 -> 5 * 2 = 10 a=6 -> 6 * 2 = 12 a=7 -> 7 * 2 = 14 a=8 -> 8 * 2 = 16 a=9 -> 9 * 2 = 18 [1] Warning message: In if (a > 3) a * 2 else a - 1 : the condition has length > 1 and only the first element will be used!x HS17 UniBas JCW

10 Bioinformatics - R for (i in 5:9) print(if(i <= 6) i*2 else i-1) a=5 -> 5 * 2 = 10 a=6 -> 6 * 2 = 12 a=7 -> 7-1 = 6 a=8 -> 8-1 = 7 a=9 -> 9-1 = 8 [1] 10 [1] 12 [1] 6 [1] 7 [1] 8!X HS17 UniBas JCW

11 Bioinformatics - R for (i in 5:9) cat((if(i <= 6) i*2 else i-1), " ") [1] !X HS17 UniBas JCW

12 What is data? What is a database?!12

13 Size Accessibility Manageability Redundancy Accuracy Security!13

14 A database is an organized collection of data. Databases are created to operate large quantities of information by inputting, storing, retrieving, and managing that information. A database management system (DBMS) is a suite of computer software providing the interface between users and a database or databases. *DBMS Types: Navigational (e.g. Hierarchical, Network) / Relational (e.g. mysql) / Object Oriented!14

15 ID Name Surname 1 Ferdy Kübler 2 Eddy Merckx 3 Lance Armstrong 4 Andy Schleck 5 Frank Schleck!15

16 ID Name Surname Rang ID 1 Ferdy Kübler 2 Eddy Merckx 3 Lance Armstrong 4 Andy Schleck 5 Frank Schleck Primary Key (unique) !16

17 ID Name Surname Rang ID 1 Ferdy Kübler 2 Eddy Merckx 3 Lance Armstrong 4 Andy Schleck 5 Frank Schleck Primary Key (unique) Foreign Key not (necessarily) unique!17

18 ID Name Surname 1 Ferdy Kübler 2 Eddy Merckx 3 Lance Armstrong 4 Andy Schleck 5 Frank Schleck Table relationship(s) Rang ID Ferdy Kübler (Sui) 2. Andy Schleck (Lux) 3. Eddy Merckx (Bel)!18

19 Biological Sequence Data >CG40218 Yeti (isoform A) ATGAACTCACAAAAAGAATACGTATCGGACTGCGAAACCGACGATGATTATTATGTCGATTTGTTAACTT CAGGCAAGGGCAGTGATAAGAGTGAAAGTGATGTGTCGGACAAGTCTGAAAATTATCCAGGCCTAAAATC AAAGCATACTGCGAAGGCATTGCGGAAAACAAGGCATTGTGACGGCGATAATAGGGAATACAGGTCTAAG GAGTGCGACGACCTTCATTCCGAAGAGGAGTCTGAAAAATCGCGGTCGGATGCTTTATGGGCCGATTTTC TTGGCGACATTGATACTAAAAGCGTAATCAACCAAAAAACAGATTATACGGAGGGAAACGCAGCAAGTGC TACCAATACCAATACGCATGAGACTTGTAATAAATATGATAAAAACGATACGGCAATAATAAAAACTGCA CAGCAATACGATTCCAAAAGAACCACGCTTTCAGTTTCCACACTCGGAAAAATTAAACGATCATCCGCTG AAAAGAGTATCGGTACCATGATAAATAAATTTGAAAAGAAGAAAAAATTGACAGTGCTTGAAAGGTCACA ATTGGATTGGAAAATATTTAAACAAGACGAAGGCATAGACGAACTTCTGTGCTCGCATAACAAAGGCAAG GACGGGTATTTGGACCGTCAAGACTTTTTGGAGAGAACCGATCTTAGGCAGTTTGAAATGGAAAAGAAGT TGCGGCTGTCTCGCAGGCCATACTAA *Yeti is a heterochromatin gene involved in e.g. chromatin organization in Drosophila melanogaster.!19

20 !20

21 txt txt jpg tre xml txt txt txt!21

22 nt/nr PubMed Books SNP CDD...!22

23 Primary Databases National Center for Biotechnology GenBank : NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. ENA - The European Nucleotide Archive (ENA) captures and presents information relating to experimental workflows that are based around nucleotide sequencing. DDBJ - DNA Data Bank of Japan was established INSDC - The International Nucleotide Sequence Databases (INSD) have been developed and maintained collaboratively between DDBJ, ENA, and GenBank for over 18 years. insdc.org/!23 HS18 HS16 UniBas JCW

24 Primary Databases User submit retrieve Error Rate unknown Secondary Databases User retrieve Error Rate lower!24

25 Secondary Databases Ensembl? - Ensembl is a joint project between EMBL - EBI and the Wellcome Trust Sanger Institute to develop a software system which produces and maintains automatic annotation on selected eukaryotic genomes. UCSC Genome Browser - This site contains the reference sequence and working draft assemblies for a large collection of genomes. It also provides portals to the ENCODE and Neanderthal projects.!25 HS18 HS16 UniBas JCW

26 Question Exercises!26

27 Improve the following search terms: a) Martin Feder heat shock proteins fruit fly b) Mouse and human trf1 DNA sequences c) Number of DNA polymerase (Pol) proteins in Firmicutes!27

28 Self-Study Guide!28

29 National Center for Biotechnology GenBank : NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. ENA - The European Nucleotide Archive (ENA) captures and presents information relating to experimental workflows that are based around nucleotide sequencing. DDBJ - DNA Data Bank of Japan was established INSDC - The International Nucleotide Sequence Databases (INSD) have been developed and maintained collaboratively between DDBJ, ENA, and GenBank for over 18 years. insdc.org/!29

30 nt/nr PubMed Books SNP CDD...!30

31 !31

32 !32

33 Database {All Databases} Search term / query Daphnia!33

34 !34

35 !35

36 !36

37 !37

38 Filter Hits Extensions!38

39 !39

40 !40

41 !41

42 Search Term Daphnia magna Search Term "Daphnia magna" [Organism]!42

43 !43

44 !44

45 !45

46 An important step in the analysis of genome information is deciphering the complete coding potential or protein coding sequence (CDS) region of each gene. CDS is a sequence of nucleotides that corresponds with the sequence of amino acids in a protein. A typical CDS starts with ATG and ends with a stop codon. CDS can be a subset of an open reading frame (ORF). In eukaryotes, prediction of CDS regions in genomic sequence is complicated by a low percentage of the genome devoted to CDS and by interruptions of CDS regions by introns. It is not possible at present to predict from genomic sequence the correct distribution of CDS regions that appear in the proteins expressed from a genome. To obtain information about the portion of the mammalian genome that is translated into protein, the mature messengers of the genome's coding potential (full-length mrnas)must be sampled.!46

47 !47

48 !48

49 !49

50 !50

51 !51

52 !52

53 !53

54 Search Term NC_0000*[Accession] AND Human[Organism]!54

55 !55

56 Query Translation (nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession] OR nc [Accession]) AND "Homo sapiens"[organism]!56

57 Wildcard is a character that may be substituted for any of a defined subset of all possible characters. * (star, asterisk) col*r color, colour matches zero or more characters? (question mark) 123? 1231, 123A,... matches exactly one characters [ ] (square brackets) 123[A-B] 123A, 123B match a single character within the range!57

58 !58

59 !59

60 !60

61 !61

62 !62

63 !63

64

65 Search Term X01714 Search Term U90223 Search Term AF Organism:? Size:?pb Get nucleotide and protein sequences. Organism:? Molecule type:? Where was this sequence published? Organism:? Molecule type:? Find the entire gene. Search Term Organism:? Molecule type:? NC_ Find COX3 protein region?!65

66 Search Term human [organism] AND dutpase [Protein name] Search Term human [organism] AND dutp pyrophosphatase [Title] What is the difference between the two searches? Can you extend the search to find all possible copies in the human genome?!66