Bioinformatics overview

Size: px

Start display at page:

Download "Bioinformatics overview"

Cathleen Booker
6 years ago
Views:

Bioinformatics overview Aplicações biomédicas em plataformas computacionais de alto desempenho Aplicaciones biomédicas sobre plataformas gráficas de altas prestaciones Biomedical applications

1 Bioinformatics overview Aplicações biomédicas em plataformas computacionais de alto desempenho Aplicaciones biomédicas sobre plataformas gráficas de altas prestaciones Biomedical applications in High performance computing platforms Oswaldo Trelles, PhD University of Malaga In this section we survery the bioinformatics application domain and the typical sources of data in the field

2 Definition Computer sciences, statistics, physics, chemistry,... Information Technologies Bioinformatics: The application of computational techniques to the management and analysis of biological data Molecular, clinic, population, environmental,... Acquisition, storage, retrieval, transmission, processing... The early definition of bioinformatics must be updated to better describe its application domain. Several disciplines have allied with computer sciences not only to provide support to the management of digital information representing biological data, but aimed to provide unprecedented opportunities to increase the understanding of the functions and dynamics of individual cells on the one end, and populations on the other. For biomedicine the promise is personalized genetics with implications for human health and medicine. It is well accepted by the scientific community that 'Bioinformatics is interested in the computational management of all types of biological information, regardless it refers to genes or their products, as well as in whole organisms and even complete ecological systems

3 The domain of the data There is a huge diversity of types of data obtained from biological experiments. They range from sequence information about genomes, or simply regarding to some interesting stretch of DNA, to data describing the interrelationships between different biological entities The initial flow of information is in the form of biological sequences, that are sequenced, assembled, stored, distributed and made accessible via world wide specialized web sites. Using sequence composition he next step is to group similar stretches in longer (contigs) and associate a biological function for these contigs. Sequence analysis involves the study of this king of sequential information. 3D structural prediction needs products obtained with sequence analysis (such as, conserved zones identified by multiple sequence analysis). 3D information of proteins also comes from direct experiments aimed to identified such structural information. Transcriptomics allows to study the behavior of the organism under different experimental conditions. This information is aimed to identify those genes involved in the changes of proteins levels in the cell. Since genes do not use to work alone, establish the interaction between their products (proteins) is another important step. Pathways and system biology work over these type of data

Featuring biological data Although it is a common place statement that the volume of data in molecular biology is growing at exponential rates, nonetheless, the key characteristic of biological data

4 Featuring biological data Although it is a common place statement that the volume of data in molecular biology is growing at exponential rates, nonetheless, the key characteristic of biological data is not so much its volume, but its diversity, heterogeneity and dispersion The accumulated biological knowledge needed to obtain a more complete view of any biological process (e.g. sequences, structures, gene-expression data, pathways) is disseminated around the world in the form of biology sequences and structure databases, frequently as flat files, as well as image/scheme-based libraries, web-based information, particular and specific query systems, etc. The heterogeneity in formats and storage media besides the diversity and dispersion of data make difficult to use this plethora of interrelated information. The knowledge of the internals of these information sources for unified access, is a clear and important technological priority

Data production The huge amount of information generated collectively by new parallel data acquisition technologies such

new challenges on computational data processing and analysis.

dynamics of individual cells on the one end, and populations on the other.

Atoms Proteins Interactions Metabolic pathways Cells Organs Organisms Populations References [1] Collins, F.S., et al.

(2000), Database mining in the human genome initiative (white paper), http://www.biodatabases.com/whitepaper01.

(2001), The sequence of the human genome, Science, vol 291, Issue 5507, 1304-1351 [4] Schena M, Shalon D, Davis RW, Brown

Science 1995 Oct 20;270(5235):467-70 [5] Zimdahl, H., et al.

5 Data production The huge amount of information generated collectively by new parallel data acquisition technologies such as Next Generation Sequencing (NGS), High Throughput (HT) proteomics and gene expression microarrays technologies impose new challenges on computational data processing and analysis. These technologies offer biologists unprecedented opportunities to increase the understanding of the functions and dynamics of individual cells on the one end, and populations on the other. For biomedicine the promise is personalized genetics with implications for human health and medicine. Atoms Proteins Interactions Metabolic pathways Cells Organs Organisms Populations References [1] Collins, F.S., et al. (1998). New goals for the U.S. Human Genome Project. Science 282, 5389, [2] Houle et al. (2000), Database mining in the human genome initiative (white paper), [3] Venter, J.Craig et al. (2001), The sequence of the human genome, Science, vol 291, Issue 5507, [4] Schena M, Shalon D, Davis RW, Brown PO (1995), Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995 Oct 20;270(5235): [5] Zimdahl, H., et al. (2004) A SNP Map of the Rat Genome generated from cdna sequences, Science Vol 303, Feb 2004 [6] NCBI, National Center for Biotechnology Information (1999) Genebank statistics. [7] Expasy server: Swiss-prot protein knowledgeable statistics: [8] EBI; European Bioinformatics Institute, Statistics: [9] Genome databases:

In particular, NGS has been highlighted for example, in the special edition on 'Big Data: welcome to the petacentre, science in the petabyte era' of Nature (Editorial 2008/11).

6 Data Volume The latest technological advances in biological and biomedical data acquisition are creating mountains of data. New technologies such as nanopore sequencing, single molecule sequencing or the better-known NGS technologies are revolutionizing the age of data analysis. In particular, NGS has been highlighted for example, in the special edition on 'Big Data: welcome to the petacentre, science in the petabyte era' of Nature (Editorial 2008/11). The arrival of DNA sequencing technologies that produce vast amounts of sequence information has triggered a paradigm shift in genomics, enabling direct genome comparison and parallel surveying of populations. While sequencing is the champion of recent advances in 'big data', metabolomics, proteomics and microarrays also generate large amounts of data. Ultrahigh density microarrays produce more than 5 million probes on a single microarray slide, against one hundred thousand probes per array a few years ago. Not only the number of data points per sample has increased, also the number of samples per experiment has been growing at rapid rate. For example, the analysis of expression Quantitative Trait Loci (eqtl) using tiling, or SNP arrays are already problematic on a single computer for 100 samples. With recent projects genotyping over 15,000 patients and growing the problem of analysis is imminent. Growing rates in the volume of data stored in Uniprot and entries in the UniProt + TrEMBL. On the right the number of proteins structures in PDB

Diversity of types of data > E01306 229 bp DNA linear gaattctaac ggtcccgaaa ctctgtgcgg tgctgaactg gttgacgctc tgcagtttgt

tgcgtcgtct ggaaatgtat tgcgctcccc tgaaacccgc taaatctgct tagaagctt Apart from the capacity for data production, modern

Not only sequence data but also, full genomes, protein folding or 3D conformation, expression levels of genes associated

Moreover, new data types are continuously being developed as new technologies emerge for creating and analyzing data A DNA

7 Diversity of types of data > E bp DNA linear gaattctaac ggtcccgaaa ctctgtgcgg tgctgaactg gttgacgctc tgcagtttgt ttgcggtgac cgtggttttt attttaacaa acccactggt tatggttctt cttctcgtcg tgctccccag actggtattg ttgacgaatg ctgctttcgt tcttgcgacc tgcgtcgtct ggaaatgtat tgcgctcccc tgaaacccgc taaatctgct tagaagctt Apart from the capacity for data production, modern molecular biology research in different (related) domains. Not only sequence data but also, full genomes, protein folding or 3D conformation, expression levels of genes associated with the protein levels in the cells, metabolic pathways, protein interactions, protein domains, etc. Moreover, new data types are continuously being developed as new technologies emerge for creating and analyzing data A DNA sequence in FASTA format is shown in the upper part of the images, below a 3D protein model and the light intensities of a DNA microarray. Finally in the bottom, a mass spectrogram and a electrophoresis agarose gel to separate proteins

8 Format heterogeneity Unfortunately, bioinformatics has grew in some aspects in a chaotic way. The same sequence can be stored found in different format (in very well known servers, such as EBI or NCBI). LOCUS E bp DNA linear PAT 04-NOV-2005 DEFINITION DNA encoding human insulin-like growth factor I(IGF-I). ID E01306; SV 1; linear; unassigned DNA; PAT; SYN; 229 BP. ACCESSION E01306 AC E01306; VERSION E GI: DT 07-OCT-1997 (Rel. 52, Created) KEYWORDS JP A/1. DT 09-NOV-2005 (Rel. 85, Last updated, Version 3) SOURCE synthetic construct DE DNA encoding human insulin-like growth factor I(IGF-I). ORGANISM synthetic construct KW JP A/1. other sequences; artificial sequences. OS synthetic construct REFERENCE 1 (bases 1 to 229) OC other sequences; artificial sequences. AUTHORS Raasu,A., Toomasu,M., Berun,N. and Majiasu,U. RA Raasu A., Toomasu M., Berun N., Majiasu U.; TITLE METHOD FOR TRANSPORTING GENE PRODUCT TO MEDIUM PROPAGATING GRAM RT "METHOD FOR TRANSPORTING GENE PRODUCT TO MEDIUM PROPAGATING NEGATIVE BACTERIA GRAM JOURNAL Patent: JP A 1 20-AUG-1987; RT NEGATIVE BACTERIA"; KABIGEN AB RL Patent number JP A/1, 20-AUG COMMENT OS Artificial gene RL KABIGEN AB. OC Artificial sequence; Genes. CC OS Artificial gene OS Homo sapiens CC OC Artificial sequence; Genes. PN JP A/1 CC OS Homo sapiens PD 20-AUG-1987 CC CC strandedness: Single; CC strandedness: Single; CC CC topology: Linear; CC topology: Linear; CC CC hypothetical: No; CC hypothetical: No; CC CC anti-sense: No; CC anti-sense: No; CC FH Key Location/Qualifiers FH Key Location/Qualifiers CC FT mat_peptide FT /product='human insuline-like growth factor I CC FT CDS > FT CDS > CC FT /product="human insulin-like growth factor I" FEATURES Location/Qualifiers FH Key Location/Qualifiers source FT source /organism="synthetic construct" FT /organism="synthetic construct" /mol_type="unassigned DNA" FT /mol_type="unassigned DNA" /db_xref="taxon:32630" FT /db_xref="taxon:32630" ORIGIN SQ Sequence 229 BP; 40 A; 57 C; 55 G; 77 T; 0 other; 1 gaattctaac ggtcccgaaa ctctgtgcgg tgctgaactg gttgacgctc tgcagtttgt gaattctaac ggtcccgaaa ctctgtgcgg tgctgaactg gttgacgctc tgcagtttgt ttgcggtgac cgtggttttt attttaacaa acccactggt tatggttctt cttctcgtcg ttgcggtgac cgtggttttt attttaacaa acccactggt tatggttctt cttctcgtcg tgctccccag actggtattg ttgacgaatg ctgctttcgt tcttgcgacc tgcgtcgtct tgctccccag actggtattg ttgacgaatg ctgctttcgt tcttgcgacc tgcgtcgtct ggaaatgtat tgcgctcccc tgaaacccgc taaatctgct tagaagctt ggaaatgtat tgcgctcccc tgaaacccgc taaatctgct tagaagctt 229 // // The DNA encoding human insulin-like growth factor I(IGF-I) available at GenBank: E (search for insulin in All databases ) The same insulin (E01306) sequence at (in both text-boxes some lines has been removed)

Dispersion of data sources Currently, more than 1000 biological data

quality (curated databases), in different formats, including a variety of

Bioinformatics tasks usually require the application of different services to

In a distributed and diverse environment, data and service integration become

9 Dispersion of data sources Currently, more than 1000 biological data collections are publicly available in the Internet, with different level of quality (curated databases), in different formats, including a variety of links to related information available in other collections. Bioinformatics tasks usually require the application of different services to combine or process at different degree the data. In a distributed and diverse environment, data and service integration become crucial for efficient data analysis in bioinformatics See: [1] Infobiogen: Catalog of Databases:

Bioinformatics servers The Internet browser has become the

combined and recombined to produce the desired result.

universal availability of web resources that often need to be

This set of computational compounds resides in more than one

10 Bioinformatics servers The Internet browser has become the in-silico pipette of traditional wet-labs, where compounds are combined and recombined to produce the desired result. In a similar way, bioinformatics strongly relies in the universal availability of web resources that often need to be combined to produce a useful outcome. This set of computational compounds resides in more than one thousand databases (Galperin, 2009) and more than 130 servers linking around 1200 services (Fox J. et al. 2008).

11 Types of data and applications (overview)

12 Sequencing data The long DNA chain is split in small fragments that are read using sequencing technology. Bioinformatics tasks over the final data include: Getting statistical summaries about the base-call quality scores to study the data quality. Calculating a coverage vector and exporting it for visualization in a genome browser. Reading in annotation data from a GFF file. Assembly fragments into longer chunks called contigs Assigning aligned reads to exons and genes. Biologically intelligent interpretation of genomic data >000014_1863_0292 length=76 uaccno=fgsmdpn08etuie AATACTCAGGAATCGAACGGACTCGGGTATAGTATATGATCGGCAGCCAGCCG AACATAACAGCGGCATGAAAACC >000016_1821_0619 length=120 uaccno=fgsmdpn08ep50t GGCAAGTTTTCGGTGTCGCTAAGCCCGAGATATCGCAGCTCACCCGTGTCGGC GATTGCTGCTGTGACCGTCCCCAGTCGGTCACCCTCCGGCTGATTCTATCCTT ACATCGGTCGTTTC >000021_1845_1786 length=69 uaccno=fgsmdpn08esarw ATCCGCGCGGCCGCATTGTCGACACTGCCTGCCGGCAGTGAAGGCGAGGCGCA GGTGGCCGATGCGCTG >000030_1849_0863 length=69 uaccno=fgsmdpn08esmpd ATCCGCGCGGCCGCATTGTCGACACTGCCTGCCGGCAGTGAAGGCGAGGCGCA GGTGGCCGATGCGCTG >000035_1856_0283 length=148 uaccno=fgsmdpn08es8dp GACGCCCTTTATGCACGTTTCGCTCACAGTATCCCTTAATAGCAAGATTAATA CCCTCAGTGGCCCCACTAGTAAAAACGATCTCTCGAGAACGACAGTTCAGTTC ATTGGCAATCAATTTTCGGGCCGTTTCTTACCGCCTCCTCAG FASTQ has emerged as a common file format for sharing sequencing read data combining both the sequence and an associated per base quality score. PHRED introduced the concept of base-calling quality in terms of the estimated probability of error Q PHRED = -10 x log10(p e ) This information is stored in a plain-text (space separated) set of positive numbers using a format similar to FASTA. It can also be stored as a string in ACSII code (+ 33 to avoid un-printable characters) Note: complete the information about FASTQ format from the Web

needed to solve the assembling fragments into a longer Contigs.

13 Assembling the puzzle In a first step the software (in general provided by the sequencing device supplier) is able to interpret the spectrograms and translate it into a sequence of letters An exhaustive and resource consuming procedure is needed to solve the assembling fragments into a longer Contigs... the sequence is coming up Important to mention is the necessary quality control of the sequencer output, to remove sequences belonging to the cloning vectors used, linkers, or low quality data

14 Biological sequence data biológicas >ref NT_ : Drosophila melanogaster chromosome 2L CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGG GAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAA TGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTT AAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCG TTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATT TAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGAC TGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGAT ATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGC CAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTC AAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGA GAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAA GACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCG AGAGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAAC CCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATAT TATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAAC CCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGC AACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGC CTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGA GATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATAT AATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGA ACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAG TAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAAT ATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGT ATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACC CAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCG TATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACC GCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTC CCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTT GGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATT CATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAAT GAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGC AGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCT CTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGA AGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATG CGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGAT GATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGA TCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCG CAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATG ATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTAT TTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTT TCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGA TTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAA TAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGT GCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGA GAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTG TGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGAC AGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCG The final product of the assembly process are string of [A,C,G,T] characters, including a first line with some information to identify the sequence (see the text-box on the left for a 3290 nn sequence with a first line with some information of sequence location. FASTA is the favorite format used for this type of data In our simile around the book of life, now we have the text but it is unknown the meaning of that message, the punctuation symbols, final product, biological process in which it is involved, etc.

15 Sequence databases ID 100K_RAT STANDARD; PRT; 889 AA. AC Q62671; DT 01-NOV-1997 (Rel. 35, Created) DT 01-NOV-1997 (Rel. 35, Last sequence update) DT 15-JUL-1999 (Rel. 38, Last annotation update) DE 100 KD PROTEIN (EC ). OS Rattus norvegicus (Rat). OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia; OC Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Rattus. RN [1] RP SEQUENCE FROM N.A. RC STRAIN=WISTAR; TISSUE=TESTIS; RX MEDLINE; RA MUELLER D., REHBEIN M., BAUMEISTER H., RICHTER D.; RT "Molecular characterization of a novel rat protein structurally RT related to poly(a) binding proteins and the 70K protein of the U1 RT small nuclear ribonucleoprotein particle (snrnp)."; RL Nucleic Acids Res. 20: (1992). RN [2] RP ERRATUM. RA MUELLER D., REHBEIN M., BAUMEISTER H., RICHTER D.; RL Nucleic Acids Res. 20: (1992). CC -!- FUNCTION: E3 UBIQUITIN-PROTEIN LIGASE WHICH ACCEPTS UBIQUITIN FROM CC AN E2 UBIQUITIN-CONJUGATING ENZYME IN THE FORM OF A THIOESTER AND CC THEN DIRECTLY TRANSFERS THE UBIQUITIN TO TARGETED SUBSTRATES (BY CC SIMILARITY). THIS PROTEIN MAY BE INVOLVED IN MATURATION AND/OR CC POST-TRANSCRIPTIONAL REGULATION OF MRNA. CC CC This SWISS-PROT entry is copyright. It is produced through... CC DR EMBL; X64411; CAA ; -. DR PFAM; PF00632; HECT; 1. DR PFAM; PF00658; PABP; 1. KW Ubiquitin conjugation; Ligase. FT DOMAIN ASP/GLU-RICH (ACIDIC). FT DOMAIN PRO-RICH. FT DOMAIN ASP/GLU-RICH (ACIDIC). FT BINDING UBIQUITIN (BY SIMILARITY). SQ SEQUENCE 889 AA; MW; DD7E6C7A CRC32; MMSARGDFLN YALSLMRSHN DEHSDVLPVL DVCSLKHVAY VFQALIYWIK AMNQQTTLDT PQLERKRTRE LLELGIDNED SEHENDDDTS QSATLNDKDD ESLPAETGQN HPFFRRSDSM VYEYVRKYAE HRMLVVAEQP LHAMRKGLLD VLPKNSLEDL TAEDFRLLVN GCGEVNVQML ISFTSFNDES GENAEKLLQF KRWFWSIVER MSMTERQDLV YFWTSSPSLP ASEEGFQPMP SITIRPPDDQ HLPTANTCIS RLYVPLYSSK QILKQKLLLA IKTKNFGFV // One of the first data processing tasks in genomic projects is to provide support for sequence data management. Although most of the sequence data types are stored as plain text unformatted files but historically has been named as databases. There are a high diversity of records, from the simplest that contains only a string of characters representing the sequence of nucleotides of some stretch of DNA or even protein data represented by its sequence of amino acids. More complete information is provided as functional annotations for the corresponding sequence. Data management and public accessibility to these sources of information is one of the most active areas in bioinformatics. Sequence data and functional annotations retrieval, by-content database searching, sequence alignment, etc. are frequent services offered to exploit such sources of data.

16 Data and software >ref NT_ : Drosophila melanogaster chromosome 2L CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGG GAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAA TGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTT AAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCG TTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATT TAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGAC TGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGAT ATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGC CAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTC AAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGA GAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAA GACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCG AGAGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAAC CCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATAT TATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAAC CCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGC AACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGC CTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGA GATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATAT AATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGA ACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAG TAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAAT ATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGT ATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACC CAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCG TATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACC GCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTC CCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTT GGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATT CATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAAT GAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGC AGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCT CTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGA AGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATG CGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGAT GATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGA TCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCG CAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATG ATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTAT TTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTT TCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGA TTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAA TAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGT GCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGA GAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTG TGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGAC AGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCG Al final como ya debemos conocer- lo que obtenemos son las cadenas lineales de las secuencias, que pueden formar un genoma completo o un gen determinado o ser de una proteína. Estas secuencias se guardan, junto a otros datos que las describen, en bases de datos públicas y accesibles vía Internet. Además de mantener las bases de datos, estos sitios Web proporcionan servicios computacionales sobre estos conjuntos de datos, por ejemplo comparar una nueva secuencia contra todas las almacenadas en la BD para saber cual es su función; comprar varias secuencias entre sí para ver lo que tienen en común, etc. Algunos sitios web: European Bioinformatics Institute (EB) at o el NCBI americano, o los datos de proteínas (Swissprot y PDB) en el Instituto Suizo (SIB) En general consideraremos el software como una tecnología sin entrar en detalles de clasificación, pero mostrando diversos ejemplos de tipo de software que se requiere hoy en día

17 Use Case: HSPs dealing with Big-Data This section is part of the Supplementary Material of the Out-of-core computation of HSPs for large sequences, by Andrés Rodriguez, Óscar Torreño and Oswaldo Trelles More info and programs at:

i = 1 n; j=1 m. The algorithm complexity of dotplot calculation is O(NM). When an averaging window of size W is used to reduce the noise of random matches, complexity grows to O(NMW).

18 Dotplots In more formal terms, let S n = {x 1, x 2,..., x n } be a genomic sequence composed by a linear string of symbols belonging to the DNA alphabet (x i A = {A, C, G, T}). The number of symbols in the chain is the length of the sequence S = n Given two genomic sequences X n and Y m, a dotplot D is a (n x m) matrix, such that D i,j = 1 when x i = y j otherwise D i,j = 0 i = 1 n; j=1 m. The algorithm complexity of dotplot calculation is O(NM). When an averaging window of size W is used to reduce the noise of random matches, complexity grows to O(NMW). In this case D i+w/2,j+w/2 = 1 when T< S = k=1 W TRUE(x i+k = y j+k ), where T is a given noisethreshold and S is the identity level. Basic dot-plot procedure. On the left two short sequences are compared setting a dot in the intersection of identity matches. Shaded zones show diagonals, the typical signal that reflects sequence similarities. In this case, the GNLEREC sub-sequence is present both, in horizontal and vertical sequences, together with other small fragments, such as CSF and REV. In the right hand side a more realistic plot for longer sequences. In this case it is also possible to observe small inverted diagonals representing palindromic sub-sequences

Dotplots: speed-up the process To reduce the computational space and accelerate data processing most of the proposed strategies uses some kind of pre-processing step.

19 Dotplots: speed-up the process To reduce the computational space and accelerate data processing most of the proposed strategies uses some kind of pre-processing step. K-mers are used as prefixes for fast identification of matching words to be used as seed points from where to extend the local alignment. In the image a hash table using different prefix length (K=1, 2, 3). The header contains the word and the table contains the positions in which that word appears in the sequence. As longer the prefix is, the shorter the number of occurrences is. In this case, the hash is built-up by full-identity, thus all the words are the same for a given header entry (this is a tradeoff between sensitivity and memory requirements) The number of possible hash-headers is 4 K where K is the word length, although not all the combinations are present in the sequence. The exact number of words is L-K+1 being L the sequence length.

20 HITS: k-mers are used as seed points Hash Table for Seq X Positions of the symbols pos : seqx: TCAGACGATTG n=11 Hash Table (seqx for K=1) A 3, 5, 8 C 2, 6 G 4, 7, 11 T 1, 9, 10 Hash Table for Seq Y Positions of the symbols pos : seqx: ATCGGAGCTG n=10 Hash Table (seqy for K=1) A 1, 6 C 3, 8 G 4, 5, 7, 10 T 2, 9 Identical words produce hits in the coordinates they appear (diag= h - v, when x h matches y v ) (A) Produces hits in : (3, 1), (3,6), (5, 1), (5, 6), (8, 1) and (8,6) (C) Produces hits in : (2, 3), (2, 8), (6, 3) and (6,8) The number of hits depends on the number of matching-word repetitions, and it depends on the size of K

Big-Hits: joining hits by proximity Hits in the same diagonal at a distance shorter that a parameter D will be joint to form a big-hit (in the image the three hits in the first upper diagonal (at

21 Big-Hits: joining hits by proximity Hits in the same diagonal at a distance shorter that a parameter D will be joint to form a big-hit (in the image the three hits in the first upper diagonal (at distances X1, X2 < D) and the two hits in the second diagonal (X3<D) are joint tor form a 3BigHit and 2BigHit Behavior of the seed-points computational space as a function of K (word length) and the inter-hits distance parameter used to group neighbor hits. Real data (chromosomes X from several species) have been used in the simulation.

22 Pre-processing (masking LCR, HSPs out-of-core: the global idea Sequence 1 (W size) Sequence 2 Hashing (including sub-processes such as sorting, grouping, etc) HITS by diagonal Including BigHits detection and pruning (see slide 8 for hits positions) HITS Extension Search for similarities using hits as seed poitns Post-processing Visualization, frequencies, words distributions, etc

23 1st Step: Building the dictionaries (hash tables on disk) Sequence words: scan the sequence storing in disk the collection of words and positions words: order the collection by words (similar words becomes in consecutive positions) w2hd: creates the hash table (named, the dictionary) in disk. words Words &pos Sort Ordered words W2HD Dictionar y Noteworhty to observe, this process needs to be done only once for each sequence (or twice if the complementary reverse is going to be also analysed). The dictionary can be computed for K=32 which contains all the prefixes for k <K

24 Starting: dictionaries of the sequences to compare Hits: hits production based on identical words (the size of K can be redefined) to increse sensitivity Sort: by diagonal and offset in the diagonal 2nd Step: alignments from hits (ungapped fragments detection) Dictionar y seq. X HITS (K value) Hits SORT (by diagonal/offset) Dictionar y seq. Y Big-Hits: join hits by proximity Ordered hits Big-Hits (D) FragsFromHits: The main procedure. Starting from seed-point extend the local ungapped alignment Other: several tools can be used for postprocessing BIG Hits FragsFromHits Fragments ViewFrags

25 Module Description Input/outputs Words The full workflow modules Building a k-words dictionary. - masking Low Complexity Regions - Words production - Sort of words (sort programm) - Merging partial sets of words Sequences / Ordered set of words W2hd Organize words in a hash table (disk). This table contains for each word the number of repeats and the positions of each repeat Ordered set of words / 2 levels hash table (Pfix Hits BigHits The same word in both seqs will produce a hit in each pair of position combinations. The diagonal number for the hit is also computed Seeds identification - Order the collection of hits (sorthits) - Identify consecutive hits as big-hits - [filtering of isolated hits] Hash tables for each sequence / Hits (diag, X,Y) Hits collection (diag, posx, posy) / Reduce Big-Hits collection (diag, posx, posy) Fragments Post-Process Un-gapped fragment detection by extension of seed points post processing (available) - Words frequencies - Fragments distribution (Length, Score) - Dotplot visualization - Detailed fragment composition Big-Hits collection / Un-gapped fragments Several of intermediate files / Several outputs

26 The full workflow vision Workflow - Pre-computed dictionaries - Simple modules: extreme frequencies analysis, annotation for functional genomics, visualization tools, etc. - Include new features: scoring schemes (remote homologous) - Stop Review Resume interactive analysis.

27 Benchmarking

28 E.coli (K12 vs O157) hsp ( 5 Mbp) Reference: Krumsiek, Jan, et al. (2007); "Gepard: a rapid and sensitive tool for creating dotplots on genome scale"; Bioinformatics Vol. 23 no. 8, in 20 seconds

Human ChrX hsp ( 150 Mbp) Comparative genomics is the study of the relationship of genome structure and function across different biological species or strains Pan troglodytes Macaca mulatta Canis

29 Human ChrX hsp ( 150 Mbp) Comparative genomics is the study of the relationship of genome structure and function across different biological species or strains Pan troglodytes Macaca mulatta Canis familiaris Mus musculus Rattus norvegicus Bos taurus. Massive comparison of full genomes New insights and experimental data to contrast the evolutionary models for populations and species Complex evolutionary studies Identify evolution events: inversions, translocations, gene duplication Homology / Genomic mutation Comparative Maps Gene order and content (distances) Phylogenetic Analysis (models) Whole genome alignment Finding conserved blocks Distances, models,

Bioinformatics overview

Bioinformatics overview Aplicações biomédicas em plataformas computacionais de alto desempenho Aplicaciones biomédicas sobre plataformas gráficas de altas prestaciones Biomedical applications in High performance