Databases in genomics

Size: px
Start display at page:

Download "Databases in genomics"

Transcription

1 Databases in genomics

2 Search in biological databases: The most common task of molecular biologist researcher, to answer to the following ques7ons:! Are they new sequences deposited in biological databases from my favorite species?! Do these sequences contain coding genes?! Are these genes belong to known gene families? What kind of proteins?! Are they other homologous genes from other species?! Do these sequences contain conserved non coding sequences, repeats? Search with keyword or by similarity Known sobware: Smith- Waterman, FASTA et BLAST

3 Databases The Biological sequence databases A collec7on of data Structured Indexed (table of contents) Daily updated Containing cross references with other databases Two big categories of sequence databases: Non specialized (universal): GenBank, EMBL, DDBJ, SwissProt, PIR, Specialized: PDB, ProSite, BLOCKS, Pfam, Swiss- 3Dimage,...

4 Databases The main sequence databases Universal DB NCBI Na#onal Center for Biotechnology Informa#on hyp:// GenBank: annotated collec7on of all publicly available DNA Works with EMBL and PPDJ. Store sequences from 100,000 species. Official site of BLAST. PubMed: search engine accessing primarily the database of reference COGs/KOGs: Cluster of orthologous genes EMBL The European Molecular Biology Laboratory hyp:// Expasy Expert Protein Analysis System, Proteomic hyp:// UniprotKB Swiss- Prot: manually annotated Protein sequences from experts UniprotKB TrEMBL: unreviewed and automa7c annota7on PROSITE: Domains and protein families Specialized DB (specific organisms) Flybase hyp://flybase.org/ SGD hyp:// TAIR hyp:// CuGenDB Cucurbit Genome Database hyp:// Coffee genome database: hyp://coffee- genome.org

5 Databases ExponenPal Growth of GenBank (hrp:// As of Aug 2015, GenBank release ,823,644,287 bases & 187,066,846 reported sequences (doubled approximately every 18 months) And In WGS data release 209 1,163,275,601,001 bases & 302,955,543 reported sequences

6 Databases NCBI (hrp://

7 Databases GenBank (hrp://

8 Databases GenBank Database Divisions (hrp:// hyp:// #GenBank_ASM

9 Data Access Easy Interface to use for query DescripPon of keys for searching : hrp:// samplerecord.html#locusb Go to GenBank hyp:// Use the GenBank search fields Enter search keys such as Coffea canephora [ORGN] and genomic sequence length range between 100 bp to 1000 bp : 100:1000 [SLEN] Locus name Sequence length Molecule Type Genbank Division Modification Date Definition Accession Version GI Keywords Source Organism Reference Features CDS gene [ACCN] [SLEN] [PROP] [PROP] [MDAT] [TITL] [ACCN] All fields All fields [KYWD] [ORGN] [ORGN] [TITL][AUTH][JOUR] [FKEY] [FKEY] [FKEY]

10 By key word Easy Interface to use for query ie A search for all nuc. from Coffea canephora (organism) with a sequence length between 1000 bp and gave 16 results Search for all nuc. from Coffea arabica (organism) with a sequence length between 1000 bp and >?

11 By key word Easy Interface to use for query ie A search for all nuc. from Coffea canephora (organism) with a sequence length between 1000 bp and gave 16 results Download sequences in various format Taxonomy PublicaPon

12 Make your own with GenBank and SRS! Go to EMBL SRS hrp:// - > Query builder Select EMBL (Nucleo7de database) Search all genomic DNA > 1 kb for Coffea canephora. How many sequences? Download sequences to create a Database of Coffea sequence

13 Databases Type: Coffea canephora NucleoPde GSS BioProject Biosamples

14 Databases Structure of A file in GenBank Database EU FASTA format Uniq ID Diverse data

15 Databases hyp:// Structure of one file in GenBank Database EU Iden7fica7on of the sequence Uniq ID number Taxonomic data References Annota7ons Cross references with other DB Key word

16 Databases Type: Coffea canephora NucleoPde GSS BioProject Biosamples In Biosamples Download 2 BAC end libraries for C. canephora

17 Databases Type: rice SRA: Sequence Read Archive from - Roche 454, - Illumina, - Solid, - Pacific biosciences, SRX510191: Low- coverage sequencing of ASWINA 330

18 Databases Refseq hrp:// Stable Informa7on!

19 By similarity Why? To know if my favorite sequences are similar to others already known To find all sequences from the same family To search for all sequences with a specific payern Tools To handle large dataset to analyze Regular sobware for sequence alignment unusable Use BLAST algorithm Use an heuris7c algorithm: Produce a quick solu7on when regular/classic programs are too slow, but results are approximate

20 By similarity Alignment with 2 sequences: Common approach to compare 2 sequences. Coun7ng the numbers of differences (inser7on, dele7on, subs7tu7on) Alignement global (Needlman & Wunsch, 1970) Protein A Protein B Alignement local (Smith & Waterman, 1981 ; FASTA, 1988 ; BLAST, 1990) domaine Protein A Protein B mrna gene

21 BLAST: Basic Local Alignment Tool Recherche de régions sans insertions / délétions riches en similarité ; Détermination d une longueur de mot : w = 2 ou 3 acides aminés pour les protéines ; Hachage de la séquence «requête» en mot de taille w m Séquence requête Liste de mots voisins de longueur w ayant un score supérieur à un seuil T fixé par rapport au mot m. Chaque mot similaire au mot m est comparé à chaque mot de taille w pris dans chaque séquence B i de la banque. Lorsqu un mot d une séquence B i est identique à un mot de la liste de mots voisins, un hit est enregistré. Pour chaque hit, le programme effectue une extension sans gap de l alignement dans les deux sens. L extension s arrête quand le score du mot étendu diminue de plus qu un seuil X fixé. Les segments ayant un score de similarité supérieur à un score S seuil fixé sont retenus (High Scoring Pairs = HSP).

22 BLAST: Selec7on of program SEQUENCE Library Protein BLASTP Protein T T Nucleic T BLASTN TBLASTX T Nucleic

23 Program selecpon: nucleopde (blastn) hyp://blast.ncbi.nlm.nih.gov/blast.cgi

24 hyp://blast.ncbi.nlm.nih.gov/blast.cgi Enter sequences or ID : DV Enter database OpPmize program

25 hyp://blast.ncbi.nlm.nih.gov/blast.cgi Hit number Hit distribupon according to «score»

26 hyp://blast.ncbi.nlm.nih.gov/blast.cgi

27 hyp://blast.ncbi.nlm.nih.gov/blast.cgi

28 hyp://blast.ncbi.nlm.nih.gov/blast.cgi

29 A database of Databases hrp://