Databases in genomics

Search in biological databases: The most common task of molecular biologist researcher, to answer to the following ques7ons:! Are they new sequences deposited in biological databases from my favorite species?! Do these sequences contain coding genes?! Are these genes belong to known gene families? What kind of proteins?! Are they other homologous genes from other species?! Do these sequences contain conserved non coding sequences, repeats? Search with keyword or by similarity Known sobware: Smith- Waterman, FASTA et BLAST

Databases The Biological sequence databases A collec7on of data Structured Indexed (table of contents) Daily updated Containing cross references with other databases Two big categories of sequence databases: Non specialized (universal): GenBank, EMBL, DDBJ, SwissProt, PIR, Specialized: PDB, ProSite, BLOCKS, Pfam, Swiss- 3Dimage,...

Databases The main sequence databases Universal DB NCBI Na#onal Center for Biotechnology Informa#on hyp://www.ncbi.nlm.nih.gov/ GenBank: annotated collec7on of all publicly available DNA Works with EMBL and PPDJ. Store sequences from 100,000 species. Official site of BLAST. PubMed: search engine accessing primarily the database of reference COGs/KOGs: Cluster of orthologous genes EMBL The European Molecular Biology Laboratory hyp://www.ebi.ac.uk/embl Expasy Expert Protein Analysis System, Proteomic hyp://www.ddbj.nig.ac.jp/ UniprotKB Swiss- Prot: manually annotated Protein sequences from experts UniprotKB TrEMBL: unreviewed and automa7c annota7on PROSITE: Domains and protein families Specialized DB (specific organisms) Flybase hyp://flybase.org/ SGD hyp://www.yeastgenome.org/ TAIR hyp://www.arabidopsis.org CuGenDB Cucurbit Genome Database hyp://www.icugi.org/ Coffee genome database: hyp://coffee- genome.org

Databases ExponenPal Growth of GenBank (hrp://www.ncbi.nlm.nih.gov/genbank/stapspcs) As of Aug 2015, GenBank release 209 199,823,644,287 bases & 187,066,846 reported sequences (doubled approximately every 18 months) And In WGS data release 209 1,163,275,601,001 bases & 302,955,543 reported sequences

Databases NCBI (hrp://www.ncbi.nlm.nih.gov/)

Databases GenBank (hrp://www.ncbi.nlm.nih.gov/genbank/)

Databases GenBank Database Divisions (hrp://www.ncbi.nlm.nih.gov/genbank/) hyp://www.ncbi.nlm.nih.gov/books/nbk21105/ #GenBank_ASM

Data Access Easy Interface to use for query DescripPon of keys for searching : hrp://www.ncbi.nlm.nih.gov/sitemap/ samplerecord.html#locusb Go to GenBank hyp://www.ncbi.nlm.nih.gov/ Use the GenBank search fields Enter search keys such as Coffea canephora [ORGN] and genomic sequence length range between 100 bp to 1000 bp : 100:1000 [SLEN] Locus name Sequence length Molecule Type Genbank Division Modification Date Definition Accession Version GI Keywords Source Organism Reference Features CDS gene [ACCN] [SLEN] [PROP] [PROP] [MDAT] [TITL] [ACCN] All fields All fields [KYWD] [ORGN] [ORGN] [TITL][AUTH][JOUR] [FKEY] [FKEY] [FKEY]

By key word Easy Interface to use for query ie A search for all nuc. from Coffea canephora (organism) with a sequence length between 1000 bp and 10000 gave 16 results Search for all nuc. from Coffea arabica (organism) with a sequence length between 1000 bp and 10000 - >?

By key word Easy Interface to use for query ie A search for all nuc. from Coffea canephora (organism) with a sequence length between 1000 bp and 10000 gave 16 results Download sequences in various format Taxonomy PublicaPon

Make your own with GenBank and SRS! Go to EMBL SRS hrp://www.embnet.sk:8080/srs81/ - > Query builder Select EMBL (Nucleo7de database) Search all genomic DNA > 1 kb for Coffea canephora. How many sequences? Download sequences to create a Database of Coffea sequence

Databases Type: Coffea canephora NucleoPde GSS BioProject Biosamples

Databases Structure of A file in GenBank Database EU164537 FASTA format Uniq ID Diverse data

Databases hyp://www.ncbi.nlm.nih.gov/sitemap/samplerecord.html#locusb Structure of one file in GenBank Database EU164537 Iden7fica7on of the sequence Uniq ID number Taxonomic data References Annota7ons Cross references with other DB Key word

Databases Type: Coffea canephora NucleoPde GSS BioProject Biosamples In Biosamples Download 2 BAC end libraries for C. canephora

Databases Type: rice SRA: Sequence Read Archive from - Roche 454, - Illumina, - Solid, - Pacific biosciences, SRX510191: Low- coverage sequencing of ASWINA 330

Databases Refseq hrp://www.ncbi.nlm.nih.gov/projects/refseq/ Stable Informa7on!

By similarity Why? To know if my favorite sequences are similar to others already known To find all sequences from the same family To search for all sequences with a specific payern Tools To handle large dataset to analyze Regular sobware for sequence alignment unusable Use BLAST algorithm Use an heuris7c algorithm: Produce a quick solu7on when regular/classic programs are too slow, but results are approximate

By similarity Alignment with 2 sequences: Common approach to compare 2 sequences. Coun7ng the numbers of differences (inser7on, dele7on, subs7tu7on) Alignement global (Needlman & Wunsch, 1970) Protein A Protein B Alignement local (Smith & Waterman, 1981 ; FASTA, 1988 ; BLAST, 1990) domaine Protein A Protein B mrna gene

BLAST: Basic Local Alignment Tool Recherche de régions sans insertions / délétions riches en similarité ; Détermination d une longueur de mot : w = 2 ou 3 acides aminés pour les protéines ; Hachage de la séquence «requête» en mot de taille w m Séquence requête Liste de mots voisins de longueur w ayant un score supérieur à un seuil T fixé par rapport au mot m. Chaque mot similaire au mot m est comparé à chaque mot de taille w pris dans chaque séquence B i de la banque. Lorsqu un mot d une séquence B i est identique à un mot de la liste de mots voisins, un hit est enregistré. Pour chaque hit, le programme effectue une extension sans gap de l alignement dans les deux sens. L extension s arrête quand le score du mot étendu diminue de plus qu un seuil X fixé. Les segments ayant un score de similarité supérieur à un score S seuil fixé sont retenus (High Scoring Pairs = HSP).

BLAST: Selec7on of program SEQUENCE Library Protein BLASTP Protein T T Nucleic T BLASTN TBLASTX T Nucleic

Program selecpon: nucleopde (blastn) hyp://blast.ncbi.nlm.nih.gov/blast.cgi

hyp://blast.ncbi.nlm.nih.gov/blast.cgi Enter sequences or ID : DV713130 Enter database OpPmize program

hyp://blast.ncbi.nlm.nih.gov/blast.cgi Hit number Hit distribupon according to «score»

hyp://blast.ncbi.nlm.nih.gov/blast.cgi

A database of Databases hrp://www.biodbs.info/