Bioinformatics overview
|
|
- Cathleen Booker
- 6 years ago
- Views:
Transcription
1 Bioinformatics overview Aplicações biomédicas em plataformas computacionais de alto desempenho Aplicaciones biomédicas sobre plataformas gráficas de altas prestaciones Biomedical applications in High performance computing platforms Oswaldo Trelles, PhD University of Malaga In this section we survery the bioinformatics application domain and the typical sources of data in the field
2 Definition Computer sciences, statistics, physics, chemistry,... Information Technologies Bioinformatics: The application of computational techniques to the management and analysis of biological data Molecular, clinic, population, environmental,... Acquisition, storage, retrieval, transmission, processing... The early definition of bioinformatics must be updated to better describe its application domain. Several disciplines have allied with computer sciences not only to provide support to the management of digital information representing biological data, but aimed to provide unprecedented opportunities to increase the understanding of the functions and dynamics of individual cells on the one end, and populations on the other. For biomedicine the promise is personalized genetics with implications for human health and medicine. It is well accepted by the scientific community that 'Bioinformatics is interested in the computational management of all types of biological information, regardless it refers to genes or their products, as well as in whole organisms and even complete ecological systems
3 The domain of the data There is a huge diversity of types of data obtained from biological experiments. They range from sequence information about genomes, or simply regarding to some interesting stretch of DNA, to data describing the interrelationships between different biological entities The initial flow of information is in the form of biological sequences, that are sequenced, assembled, stored, distributed and made accessible via world wide specialized web sites. Using sequence composition he next step is to group similar stretches in longer (contigs) and associate a biological function for these contigs. Sequence analysis involves the study of this king of sequential information. 3D structural prediction needs products obtained with sequence analysis (such as, conserved zones identified by multiple sequence analysis). 3D information of proteins also comes from direct experiments aimed to identified such structural information. Transcriptomics allows to study the behavior of the organism under different experimental conditions. This information is aimed to identify those genes involved in the changes of proteins levels in the cell. Since genes do not use to work alone, establish the interaction between their products (proteins) is another important step. Pathways and system biology work over these type of data
4 Featuring biological data Although it is a common place statement that the volume of data in molecular biology is growing at exponential rates, nonetheless, the key characteristic of biological data is not so much its volume, but its diversity, heterogeneity and dispersion The accumulated biological knowledge needed to obtain a more complete view of any biological process (e.g. sequences, structures, gene-expression data, pathways) is disseminated around the world in the form of biology sequences and structure databases, frequently as flat files, as well as image/scheme-based libraries, web-based information, particular and specific query systems, etc. The heterogeneity in formats and storage media besides the diversity and dispersion of data make difficult to use this plethora of interrelated information. The knowledge of the internals of these information sources for unified access, is a clear and important technological priority
5 Data production The huge amount of information generated collectively by new parallel data acquisition technologies such as Next Generation Sequencing (NGS), High Throughput (HT) proteomics and gene expression microarrays technologies impose new challenges on computational data processing and analysis. These technologies offer biologists unprecedented opportunities to increase the understanding of the functions and dynamics of individual cells on the one end, and populations on the other. For biomedicine the promise is personalized genetics with implications for human health and medicine. Atoms Proteins Interactions Metabolic pathways Cells Organs Organisms Populations References [1] Collins, F.S., et al. (1998). New goals for the U.S. Human Genome Project. Science 282, 5389, [2] Houle et al. (2000), Database mining in the human genome initiative (white paper), [3] Venter, J.Craig et al. (2001), The sequence of the human genome, Science, vol 291, Issue 5507, [4] Schena M, Shalon D, Davis RW, Brown PO (1995), Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995 Oct 20;270(5235): [5] Zimdahl, H., et al. (2004) A SNP Map of the Rat Genome generated from cdna sequences, Science Vol 303, Feb 2004 [6] NCBI, National Center for Biotechnology Information (1999) Genebank statistics. [7] Expasy server: Swiss-prot protein knowledgeable statistics: [8] EBI; European Bioinformatics Institute, Statistics: [9] Genome databases:
6 Data Volume The latest technological advances in biological and biomedical data acquisition are creating mountains of data. New technologies such as nanopore sequencing, single molecule sequencing or the better-known NGS technologies are revolutionizing the age of data analysis. In particular, NGS has been highlighted for example, in the special edition on 'Big Data: welcome to the petacentre, science in the petabyte era' of Nature (Editorial 2008/11). The arrival of DNA sequencing technologies that produce vast amounts of sequence information has triggered a paradigm shift in genomics, enabling direct genome comparison and parallel surveying of populations. While sequencing is the champion of recent advances in 'big data', metabolomics, proteomics and microarrays also generate large amounts of data. Ultrahigh density microarrays produce more than 5 million probes on a single microarray slide, against one hundred thousand probes per array a few years ago. Not only the number of data points per sample has increased, also the number of samples per experiment has been growing at rapid rate. For example, the analysis of expression Quantitative Trait Loci (eqtl) using tiling, or SNP arrays are already problematic on a single computer for 100 samples. With recent projects genotyping over 15,000 patients and growing the problem of analysis is imminent. Growing rates in the volume of data stored in Uniprot and entries in the UniProt + TrEMBL. On the right the number of proteins structures in PDB
7 Diversity of types of data > E bp DNA linear gaattctaac ggtcccgaaa ctctgtgcgg tgctgaactg gttgacgctc tgcagtttgt ttgcggtgac cgtggttttt attttaacaa acccactggt tatggttctt cttctcgtcg tgctccccag actggtattg ttgacgaatg ctgctttcgt tcttgcgacc tgcgtcgtct ggaaatgtat tgcgctcccc tgaaacccgc taaatctgct tagaagctt Apart from the capacity for data production, modern molecular biology research in different (related) domains. Not only sequence data but also, full genomes, protein folding or 3D conformation, expression levels of genes associated with the protein levels in the cells, metabolic pathways, protein interactions, protein domains, etc. Moreover, new data types are continuously being developed as new technologies emerge for creating and analyzing data A DNA sequence in FASTA format is shown in the upper part of the images, below a 3D protein model and the light intensities of a DNA microarray. Finally in the bottom, a mass spectrogram and a electrophoresis agarose gel to separate proteins
8 Format heterogeneity Unfortunately, bioinformatics has grew in some aspects in a chaotic way. The same sequence can be stored found in different format (in very well known servers, such as EBI or NCBI). LOCUS E bp DNA linear PAT 04-NOV-2005 DEFINITION DNA encoding human insulin-like growth factor I(IGF-I). ID E01306; SV 1; linear; unassigned DNA; PAT; SYN; 229 BP. ACCESSION E01306 AC E01306; VERSION E GI: DT 07-OCT-1997 (Rel. 52, Created) KEYWORDS JP A/1. DT 09-NOV-2005 (Rel. 85, Last updated, Version 3) SOURCE synthetic construct DE DNA encoding human insulin-like growth factor I(IGF-I). ORGANISM synthetic construct KW JP A/1. other sequences; artificial sequences. OS synthetic construct REFERENCE 1 (bases 1 to 229) OC other sequences; artificial sequences. AUTHORS Raasu,A., Toomasu,M., Berun,N. and Majiasu,U. RA Raasu A., Toomasu M., Berun N., Majiasu U.; TITLE METHOD FOR TRANSPORTING GENE PRODUCT TO MEDIUM PROPAGATING GRAM RT "METHOD FOR TRANSPORTING GENE PRODUCT TO MEDIUM PROPAGATING NEGATIVE BACTERIA GRAM JOURNAL Patent: JP A 1 20-AUG-1987; RT NEGATIVE BACTERIA"; KABIGEN AB RL Patent number JP A/1, 20-AUG COMMENT OS Artificial gene RL KABIGEN AB. OC Artificial sequence; Genes. CC OS Artificial gene OS Homo sapiens CC OC Artificial sequence; Genes. PN JP A/1 CC OS Homo sapiens PD 20-AUG-1987 CC CC strandedness: Single; CC strandedness: Single; CC CC topology: Linear; CC topology: Linear; CC CC hypothetical: No; CC hypothetical: No; CC CC anti-sense: No; CC anti-sense: No; CC FH Key Location/Qualifiers FH Key Location/Qualifiers CC FT mat_peptide FT /product='human insuline-like growth factor I CC FT CDS > FT CDS > CC FT /product="human insulin-like growth factor I" FEATURES Location/Qualifiers FH Key Location/Qualifiers source FT source /organism="synthetic construct" FT /organism="synthetic construct" /mol_type="unassigned DNA" FT /mol_type="unassigned DNA" /db_xref="taxon:32630" FT /db_xref="taxon:32630" ORIGIN SQ Sequence 229 BP; 40 A; 57 C; 55 G; 77 T; 0 other; 1 gaattctaac ggtcccgaaa ctctgtgcgg tgctgaactg gttgacgctc tgcagtttgt gaattctaac ggtcccgaaa ctctgtgcgg tgctgaactg gttgacgctc tgcagtttgt ttgcggtgac cgtggttttt attttaacaa acccactggt tatggttctt cttctcgtcg ttgcggtgac cgtggttttt attttaacaa acccactggt tatggttctt cttctcgtcg tgctccccag actggtattg ttgacgaatg ctgctttcgt tcttgcgacc tgcgtcgtct tgctccccag actggtattg ttgacgaatg ctgctttcgt tcttgcgacc tgcgtcgtct ggaaatgtat tgcgctcccc tgaaacccgc taaatctgct tagaagctt ggaaatgtat tgcgctcccc tgaaacccgc taaatctgct tagaagctt 229 // // The DNA encoding human insulin-like growth factor I(IGF-I) available at GenBank: E (search for insulin in All databases ) The same insulin (E01306) sequence at (in both text-boxes some lines has been removed)
9 Dispersion of data sources Currently, more than 1000 biological data collections are publicly available in the Internet, with different level of quality (curated databases), in different formats, including a variety of links to related information available in other collections. Bioinformatics tasks usually require the application of different services to combine or process at different degree the data. In a distributed and diverse environment, data and service integration become crucial for efficient data analysis in bioinformatics See: [1] Infobiogen: Catalog of Databases:
10 Bioinformatics servers The Internet browser has become the in-silico pipette of traditional wet-labs, where compounds are combined and recombined to produce the desired result. In a similar way, bioinformatics strongly relies in the universal availability of web resources that often need to be combined to produce a useful outcome. This set of computational compounds resides in more than one thousand databases (Galperin, 2009) and more than 130 servers linking around 1200 services (Fox J. et al. 2008).
11 Types of data and applications (overview)
12 Sequencing data The long DNA chain is split in small fragments that are read using sequencing technology. Bioinformatics tasks over the final data include: Getting statistical summaries about the base-call quality scores to study the data quality. Calculating a coverage vector and exporting it for visualization in a genome browser. Reading in annotation data from a GFF file. Assembly fragments into longer chunks called contigs Assigning aligned reads to exons and genes. Biologically intelligent interpretation of genomic data >000014_1863_0292 length=76 uaccno=fgsmdpn08etuie AATACTCAGGAATCGAACGGACTCGGGTATAGTATATGATCGGCAGCCAGCCG AACATAACAGCGGCATGAAAACC >000016_1821_0619 length=120 uaccno=fgsmdpn08ep50t GGCAAGTTTTCGGTGTCGCTAAGCCCGAGATATCGCAGCTCACCCGTGTCGGC GATTGCTGCTGTGACCGTCCCCAGTCGGTCACCCTCCGGCTGATTCTATCCTT ACATCGGTCGTTTC >000021_1845_1786 length=69 uaccno=fgsmdpn08esarw ATCCGCGCGGCCGCATTGTCGACACTGCCTGCCGGCAGTGAAGGCGAGGCGCA GGTGGCCGATGCGCTG >000030_1849_0863 length=69 uaccno=fgsmdpn08esmpd ATCCGCGCGGCCGCATTGTCGACACTGCCTGCCGGCAGTGAAGGCGAGGCGCA GGTGGCCGATGCGCTG >000035_1856_0283 length=148 uaccno=fgsmdpn08es8dp GACGCCCTTTATGCACGTTTCGCTCACAGTATCCCTTAATAGCAAGATTAATA CCCTCAGTGGCCCCACTAGTAAAAACGATCTCTCGAGAACGACAGTTCAGTTC ATTGGCAATCAATTTTCGGGCCGTTTCTTACCGCCTCCTCAG FASTQ has emerged as a common file format for sharing sequencing read data combining both the sequence and an associated per base quality score. PHRED introduced the concept of base-calling quality in terms of the estimated probability of error Q PHRED = -10 x log10(p e ) This information is stored in a plain-text (space separated) set of positive numbers using a format similar to FASTA. It can also be stored as a string in ACSII code (+ 33 to avoid un-printable characters) Note: complete the information about FASTQ format from the Web
13 Assembling the puzzle In a first step the software (in general provided by the sequencing device supplier) is able to interpret the spectrograms and translate it into a sequence of letters An exhaustive and resource consuming procedure is needed to solve the assembling fragments into a longer Contigs... the sequence is coming up Important to mention is the necessary quality control of the sequencer output, to remove sequences belonging to the cloning vectors used, linkers, or low quality data
14 Biological sequence data biológicas >ref NT_ : Drosophila melanogaster chromosome 2L CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGG GAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAA TGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTT AAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCG TTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATT TAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGAC TGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGAT ATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGC CAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTC AAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGA GAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAA GACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCG AGAGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAAC CCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATAT TATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAAC CCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGC AACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGC CTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGA GATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATAT AATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGA ACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAG TAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAAT ATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGT ATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACC CAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCG TATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACC GCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTC CCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTT GGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATT CATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAAT GAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGC AGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCT CTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGA AGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATG CGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGAT GATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGA TCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCG CAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATG ATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTAT TTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTT TCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGA TTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAA TAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGT GCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGA GAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTG TGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGAC AGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCG The final product of the assembly process are string of [A,C,G,T] characters, including a first line with some information to identify the sequence (see the text-box on the left for a 3290 nn sequence with a first line with some information of sequence location. FASTA is the favorite format used for this type of data In our simile around the book of life, now we have the text but it is unknown the meaning of that message, the punctuation symbols, final product, biological process in which it is involved, etc.
15 Sequence databases ID 100K_RAT STANDARD; PRT; 889 AA. AC Q62671; DT 01-NOV-1997 (Rel. 35, Created) DT 01-NOV-1997 (Rel. 35, Last sequence update) DT 15-JUL-1999 (Rel. 38, Last annotation update) DE 100 KD PROTEIN (EC ). OS Rattus norvegicus (Rat). OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia; OC Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Rattus. RN [1] RP SEQUENCE FROM N.A. RC STRAIN=WISTAR; TISSUE=TESTIS; RX MEDLINE; RA MUELLER D., REHBEIN M., BAUMEISTER H., RICHTER D.; RT "Molecular characterization of a novel rat protein structurally RT related to poly(a) binding proteins and the 70K protein of the U1 RT small nuclear ribonucleoprotein particle (snrnp)."; RL Nucleic Acids Res. 20: (1992). RN [2] RP ERRATUM. RA MUELLER D., REHBEIN M., BAUMEISTER H., RICHTER D.; RL Nucleic Acids Res. 20: (1992). CC -!- FUNCTION: E3 UBIQUITIN-PROTEIN LIGASE WHICH ACCEPTS UBIQUITIN FROM CC AN E2 UBIQUITIN-CONJUGATING ENZYME IN THE FORM OF A THIOESTER AND CC THEN DIRECTLY TRANSFERS THE UBIQUITIN TO TARGETED SUBSTRATES (BY CC SIMILARITY). THIS PROTEIN MAY BE INVOLVED IN MATURATION AND/OR CC POST-TRANSCRIPTIONAL REGULATION OF MRNA. CC CC This SWISS-PROT entry is copyright. It is produced through... CC DR EMBL; X64411; CAA ; -. DR PFAM; PF00632; HECT; 1. DR PFAM; PF00658; PABP; 1. KW Ubiquitin conjugation; Ligase. FT DOMAIN ASP/GLU-RICH (ACIDIC). FT DOMAIN PRO-RICH. FT DOMAIN ASP/GLU-RICH (ACIDIC). FT BINDING UBIQUITIN (BY SIMILARITY). SQ SEQUENCE 889 AA; MW; DD7E6C7A CRC32; MMSARGDFLN YALSLMRSHN DEHSDVLPVL DVCSLKHVAY VFQALIYWIK AMNQQTTLDT PQLERKRTRE LLELGIDNED SEHENDDDTS QSATLNDKDD ESLPAETGQN HPFFRRSDSM VYEYVRKYAE HRMLVVAEQP LHAMRKGLLD VLPKNSLEDL TAEDFRLLVN GCGEVNVQML ISFTSFNDES GENAEKLLQF KRWFWSIVER MSMTERQDLV YFWTSSPSLP ASEEGFQPMP SITIRPPDDQ HLPTANTCIS RLYVPLYSSK QILKQKLLLA IKTKNFGFV // One of the first data processing tasks in genomic projects is to provide support for sequence data management. Although most of the sequence data types are stored as plain text unformatted files but historically has been named as databases. There are a high diversity of records, from the simplest that contains only a string of characters representing the sequence of nucleotides of some stretch of DNA or even protein data represented by its sequence of amino acids. More complete information is provided as functional annotations for the corresponding sequence. Data management and public accessibility to these sources of information is one of the most active areas in bioinformatics. Sequence data and functional annotations retrieval, by-content database searching, sequence alignment, etc. are frequent services offered to exploit such sources of data.
16 Data and software >ref NT_ : Drosophila melanogaster chromosome 2L CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGG GAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAA TGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTT AAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCG TTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATT TAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGAC TGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGAT ATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGC CAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTC AAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGA GAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAA GACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCG AGAGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAAC CCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATAT TATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAAC CCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGC AACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGC CTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGA GATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATAT AATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGA ACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAG TAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAAT ATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGT ATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACC CAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCG TATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACC GCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTC CCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTT GGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATT CATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAAT GAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGC AGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCT CTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGA AGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATG CGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGAT GATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGA TCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCG CAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATG ATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTAT TTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTT TCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGA TTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAA TAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGT GCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGA GAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTG TGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGAC AGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCG Al final como ya debemos conocer- lo que obtenemos son las cadenas lineales de las secuencias, que pueden formar un genoma completo o un gen determinado o ser de una proteína. Estas secuencias se guardan, junto a otros datos que las describen, en bases de datos públicas y accesibles vía Internet. Además de mantener las bases de datos, estos sitios Web proporcionan servicios computacionales sobre estos conjuntos de datos, por ejemplo comparar una nueva secuencia contra todas las almacenadas en la BD para saber cual es su función; comprar varias secuencias entre sí para ver lo que tienen en común, etc. Algunos sitios web: European Bioinformatics Institute (EB) at o el NCBI americano, o los datos de proteínas (Swissprot y PDB) en el Instituto Suizo (SIB) En general consideraremos el software como una tecnología sin entrar en detalles de clasificación, pero mostrando diversos ejemplos de tipo de software que se requiere hoy en día
17 Use Case: HSPs dealing with Big-Data This section is part of the Supplementary Material of the Out-of-core computation of HSPs for large sequences, by Andrés Rodriguez, Óscar Torreño and Oswaldo Trelles More info and programs at:
18 Dotplots In more formal terms, let S n = {x 1, x 2,..., x n } be a genomic sequence composed by a linear string of symbols belonging to the DNA alphabet (x i A = {A, C, G, T}). The number of symbols in the chain is the length of the sequence S = n Given two genomic sequences X n and Y m, a dotplot D is a (n x m) matrix, such that D i,j = 1 when x i = y j otherwise D i,j = 0 i = 1 n; j=1 m. The algorithm complexity of dotplot calculation is O(NM). When an averaging window of size W is used to reduce the noise of random matches, complexity grows to O(NMW). In this case D i+w/2,j+w/2 = 1 when T< S = k=1 W TRUE(x i+k = y j+k ), where T is a given noisethreshold and S is the identity level. Basic dot-plot procedure. On the left two short sequences are compared setting a dot in the intersection of identity matches. Shaded zones show diagonals, the typical signal that reflects sequence similarities. In this case, the GNLEREC sub-sequence is present both, in horizontal and vertical sequences, together with other small fragments, such as CSF and REV. In the right hand side a more realistic plot for longer sequences. In this case it is also possible to observe small inverted diagonals representing palindromic sub-sequences
19 Dotplots: speed-up the process To reduce the computational space and accelerate data processing most of the proposed strategies uses some kind of pre-processing step. K-mers are used as prefixes for fast identification of matching words to be used as seed points from where to extend the local alignment. In the image a hash table using different prefix length (K=1, 2, 3). The header contains the word and the table contains the positions in which that word appears in the sequence. As longer the prefix is, the shorter the number of occurrences is. In this case, the hash is built-up by full-identity, thus all the words are the same for a given header entry (this is a tradeoff between sensitivity and memory requirements) The number of possible hash-headers is 4 K where K is the word length, although not all the combinations are present in the sequence. The exact number of words is L-K+1 being L the sequence length.
20 HITS: k-mers are used as seed points Hash Table for Seq X Positions of the symbols pos : seqx: TCAGACGATTG n=11 Hash Table (seqx for K=1) A 3, 5, 8 C 2, 6 G 4, 7, 11 T 1, 9, 10 Hash Table for Seq Y Positions of the symbols pos : seqx: ATCGGAGCTG n=10 Hash Table (seqy for K=1) A 1, 6 C 3, 8 G 4, 5, 7, 10 T 2, 9 Identical words produce hits in the coordinates they appear (diag= h - v, when x h matches y v ) (A) Produces hits in : (3, 1), (3,6), (5, 1), (5, 6), (8, 1) and (8,6) (C) Produces hits in : (2, 3), (2, 8), (6, 3) and (6,8) The number of hits depends on the number of matching-word repetitions, and it depends on the size of K
21 Big-Hits: joining hits by proximity Hits in the same diagonal at a distance shorter that a parameter D will be joint to form a big-hit (in the image the three hits in the first upper diagonal (at distances X1, X2 < D) and the two hits in the second diagonal (X3<D) are joint tor form a 3BigHit and 2BigHit Behavior of the seed-points computational space as a function of K (word length) and the inter-hits distance parameter used to group neighbor hits. Real data (chromosomes X from several species) have been used in the simulation.
22 Pre-processing (masking LCR, HSPs out-of-core: the global idea Sequence 1 (W size) Sequence 2 Hashing (including sub-processes such as sorting, grouping, etc) HITS by diagonal Including BigHits detection and pruning (see slide 8 for hits positions) HITS Extension Search for similarities using hits as seed poitns Post-processing Visualization, frequencies, words distributions, etc
23 1st Step: Building the dictionaries (hash tables on disk) Sequence words: scan the sequence storing in disk the collection of words and positions words: order the collection by words (similar words becomes in consecutive positions) w2hd: creates the hash table (named, the dictionary) in disk. words Words &pos Sort Ordered words W2HD Dictionar y Noteworhty to observe, this process needs to be done only once for each sequence (or twice if the complementary reverse is going to be also analysed). The dictionary can be computed for K=32 which contains all the prefixes for k <K
24 Starting: dictionaries of the sequences to compare Hits: hits production based on identical words (the size of K can be redefined) to increse sensitivity Sort: by diagonal and offset in the diagonal 2nd Step: alignments from hits (ungapped fragments detection) Dictionar y seq. X HITS (K value) Hits SORT (by diagonal/offset) Dictionar y seq. Y Big-Hits: join hits by proximity Ordered hits Big-Hits (D) FragsFromHits: The main procedure. Starting from seed-point extend the local ungapped alignment Other: several tools can be used for postprocessing BIG Hits FragsFromHits Fragments ViewFrags
25 Module Description Input/outputs Words The full workflow modules Building a k-words dictionary. - masking Low Complexity Regions - Words production - Sort of words (sort programm) - Merging partial sets of words Sequences / Ordered set of words W2hd Organize words in a hash table (disk). This table contains for each word the number of repeats and the positions of each repeat Ordered set of words / 2 levels hash table (Pfix Hits BigHits The same word in both seqs will produce a hit in each pair of position combinations. The diagonal number for the hit is also computed Seeds identification - Order the collection of hits (sorthits) - Identify consecutive hits as big-hits - [filtering of isolated hits] Hash tables for each sequence / Hits (diag, X,Y) Hits collection (diag, posx, posy) / Reduce Big-Hits collection (diag, posx, posy) Fragments Post-Process Un-gapped fragment detection by extension of seed points post processing (available) - Words frequencies - Fragments distribution (Length, Score) - Dotplot visualization - Detailed fragment composition Big-Hits collection / Un-gapped fragments Several of intermediate files / Several outputs
26 The full workflow vision Workflow - Pre-computed dictionaries - Simple modules: extreme frequencies analysis, annotation for functional genomics, visualization tools, etc. - Include new features: scoring schemes (remote homologous) - Stop Review Resume interactive analysis.
27 Benchmarking
28 E.coli (K12 vs O157) hsp ( 5 Mbp) Reference: Krumsiek, Jan, et al. (2007); "Gepard: a rapid and sensitive tool for creating dotplots on genome scale"; Bioinformatics Vol. 23 no. 8, in 20 seconds
29 Human ChrX hsp ( 150 Mbp) Comparative genomics is the study of the relationship of genome structure and function across different biological species or strains Pan troglodytes Macaca mulatta Canis familiaris Mus musculus Rattus norvegicus Bos taurus. Massive comparison of full genomes New insights and experimental data to contrast the evolutionary models for populations and species Complex evolutionary studies Identify evolution events: inversions, translocations, gene duplication Homology / Genomic mutation Comparative Maps Gene order and content (distances) Phylogenetic Analysis (models) Whole genome alignment Finding conserved blocks Distances, models,
Bioinformatics overview
Bioinformatics overview Aplicações biomédicas em plataformas computacionais de alto desempenho Aplicaciones biomédicas sobre plataformas gráficas de altas prestaciones Biomedical applications in High performance
More informationEECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science
EECS 730 Introduction to Bioinformatics Sequence Alignment Luke Huan Electrical Engineering and Computer Science http://people.eecs.ku.edu/~jhuan/ Database What is database An organized set of data Can
More informationIntroduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks
Introduction to Bioinformatics CPSC 265 Thanks to Jonathan Pevsner, Ph.D. Textbooks Johnathan Pevsner, who I stole most of these slides from (thanks!) has written a textbook, Bioinformatics and Functional
More informationAAGTGCCACTGCATAAATGACCATGAGTGGGCACCGGTAAGGGAGGGTGATGCTATCTGGTCTGAAG. Protein 3D structure. sequence. primary. Interactions Mutations
Introduction to Databases Lecture Outline Shifra Ben-Dor Irit Orr Introduction Data and Database types Database components Data Formats Sample databases How to text search databases What units of information
More informationSequence Databases and database scanning
Sequence Databases and database scanning Marjolein Thunnissen Lund, 2012 Types of databases: Primary sequence databases (proteins and nucleic acids). Composite protein sequence databases. Secondary databases.
More informationThis place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.
G16B BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY Methods or systems for genetic
More informationFollowing text taken from Suresh Kumar. Bioinformatics Web - Comprehensive educational resource on Bioinformatics. 6th May.2005
Bioinformatics is the recording, annotation, storage, analysis, and searching/retrieval of nucleic acid sequence (genes and RNAs), protein sequence and structural information. This includes databases of
More informationBioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine
Bioinformatics Tools Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine Bioinformatics Tools Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine Overview This lecture will
More informationNCBI web resources I: databases and Entrez
NCBI web resources I: databases and Entrez Yanbin Yin Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1 Homework assignment 1 Two parts: Extract the gene IDs reported in table
More informationProtein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)
Protein Sequence Analysis BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl) Linear Sequence Analysis What can you learn from a (single) protein sequence? Calculate it s physical
More informationTwo Mark question and Answers
1. Define Bioinformatics Two Mark question and Answers Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline. There are three
More informationCHAPTER 21 LECTURE SLIDES
CHAPTER 21 LECTURE SLIDES Prepared by Brenda Leady University of Toledo To run the animations you must be in Slideshow View. Use the buttons on the animation to play, pause, and turn audio/text on or off.
More informationComputational Biology and Bioinformatics
Computational Biology and Bioinformatics Computational biology Development of algorithms to solve problems in biology Bioinformatics Application of computational biology to the analysis and management
More informationELE4120 Bioinformatics. Tutorial 5
ELE4120 Bioinformatics Tutorial 5 1 1. Database Content GenBank RefSeq TPA UniProt 2. Database Searches 2 Databases A common situation for alignment is to search through a database to retrieve the similar
More informationBioinformatics Prof. M. Michael Gromiha Department of Biotechnology Indian Institute of Technology, Madras. Lecture - 5a Protein sequence databases
Bioinformatics Prof. M. Michael Gromiha Department of Biotechnology Indian Institute of Technology, Madras Lecture - 5a Protein sequence databases In this lecture, we will mainly discuss on Protein Sequence
More informationProtein Bioinformatics Part I: Access to information
Protein Bioinformatics Part I: Access to information 260.655 April 6, 2006 Jonathan Pevsner, Ph.D. pevsner@kennedykrieger.org Outline [1] Proteins at NCBI RefSeq accession numbers Cn3D to visualize structures
More informationData Retrieval from GenBank
Data Retrieval from GenBank Peter J. Myler Bioinformatics of Intracellular Pathogens JNU, Feb 7-0, 2009 http://www.ncbi.nlm.nih.gov (January, 2007) http://ncbi.nlm.nih.gov/sitemap/resourceguide.html Accessing
More informationChallenging algorithms in bioinformatics
Challenging algorithms in bioinformatics 11 October 2018 Torbjørn Rognes Department of Informatics, UiO torognes@ifi.uio.no What is bioinformatics? Definition: Bioinformatics is the development and use
More informationLecture 7. Next-generation sequencing technologies
Lecture 7 Next-generation sequencing technologies Next-generation sequencing technologies General principles of short-read NGS Construct a library of fragments Generate clonal template populations Massively
More informationDNAFSMiner: A Web-Based Software Toolbox to Recognize Two Types of Functional Sites in DNA Sequences
DNAFSMiner: A Web-Based Software Toolbox to Recognize Two Types of Functional Sites in DNA Sequences Huiqing Liu Hao Han Jinyan Li Limsoon Wong Institute for Infocomm Research, 21 Heng Mui Keng Terrace,
More informationRedundancy at GenBank => RefSeq. RefSeq vs GenBank. Databases, cont. Genome sequencing using a shotgun approach. Sequenced eukaryotic genomes
Databases, cont. Redundancy at GenBank => RefSeq http://www.ncbi.nlm.nih.gov/books/bv.fcg i?rid=handbook RefSeq vs GenBank Many sequences are represented more than once in GenBank 2003 RefSeq collection
More informationNEXT GENERATION SEQUENCING. Farhat Habib
NEXT GENERATION SEQUENCING HISTORY HISTORY Sanger Dominant for last ~30 years 1000bp longest read Based on primers so not good for repetitive or SNPs sites HISTORY Sanger Dominant for last ~30 years 1000bp
More informationBiotechnology Explorer
Biotechnology Explorer C. elegans Behavior Kit Bioinformatics Supplement explorer.bio-rad.com Catalog #166-5120EDU This kit contains temperature-sensitive reagents. Open immediately and see individual
More informationBiological databases an introduction
Biological databases an introduction By Dr. Erik Bongcam-Rudloff SLU 2017 Biological Databases Sequence Databases Genome Databases Structure Databases Sequence Databases The sequence databases are the
More informationBasic Bioinformatics: Homology, Sequence Alignment,
Basic Bioinformatics: Homology, Sequence Alignment, and BLAST William S. Sanders Institute for Genomics, Biocomputing, and Biotechnology (IGBB) High Performance Computing Collaboratory (HPC 2 ) Mississippi
More informationDiscovering gene regulatory control using ChIP-chip and ChIP-seq. Part 1. An introduction to gene regulatory control, concepts and methodologies
Discovering gene regulatory control using ChIP-chip and ChIP-seq Part 1 An introduction to gene regulatory control, concepts and methodologies Ian Simpson ian.simpson@.ed.ac.uk http://bit.ly/bio2links
More informationIntroduction to BIOINFORMATICS
Introduction to BIOINFORMATICS Antonella Lisa CABGen Centro di Analisi Bioinformatica per la Genomica Tel. 0382-546361 E-mail: lisa@igm.cnr.it http://www.igm.cnr.it/pagine-personali/lisa-antonella/ What
More informationThe University of California, Santa Cruz (UCSC) Genome Browser
The University of California, Santa Cruz (UCSC) Genome Browser There are hundreds of available userselected tracks in categories such as mapping and sequencing, phenotype and disease associations, genes,
More informationArray-Ready Oligo Set for the Rat Genome Version 3.0
Array-Ready Oligo Set for the Rat Genome Version 3.0 We are pleased to announce Version 3.0 of the Rat Genome Oligo Set containing 26,962 longmer probes representing 22,012 genes and 27,044 gene transcripts.
More informationGenetics and Bioinformatics
Genetics and Bioinformatics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be Lecture 1: Setting the pace 1 Bioinformatics what s
More informationChapter 2: Access to Information
Chapter 2: Access to Information Outline Introduction to biological databases Centralized databases store DNA sequences Contents of DNA, RNA, and protein databases Central bioinformatics resources: NCBI
More informationChimp Sequence Annotation: Region 2_3
Chimp Sequence Annotation: Region 2_3 Jeff Howenstein March 30, 2007 BIO434W Genomics 1 Introduction We received region 2_3 of the ChimpChunk sequence, and the first step we performed was to run RepeatMasker
More informationRegulation of eukaryotic transcription:
Promoter definition by mass genome annotation data: in silico primer extension EMBNET course Bioinformatics of transcriptional regulation Jan 28 2008 Christoph Schmid Regulation of eukaryotic transcription:
More informationIntroduction to Bioinformatics
Introduction to Bioinformatics If the 19 th century was the century of chemistry and 20 th century was the century of physic, the 21 st century promises to be the century of biology...professor Dr. Satoru
More informationBIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology
BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology Jeremy Buhler March 15, 2004 In this lab, we ll annotate an interesting piece of the D. melanogaster genome. Along the way, you ll get
More informationOutline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases
Chapter 7: Similarity searches on sequence databases All science is either physics or stamp collection. Ernest Rutherford Outline Why is similarity important BLAST Protein and DNA Interpreting BLAST Individualizing
More informationKlinisk kemisk diagnostik BIOINFORMATICS
Klinisk kemisk diagnostik - 2017 BIOINFORMATICS What is bioinformatics? Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological,
More informationSequence Analysis Lab Protocol
Sequence Analysis Lab Protocol You will need this handout of instructions The sequence of your plasmid from the ABI The Accession number for Lambda DNA J02459 The Accession number for puc 18 is L09136
More informationBIOINFORMATICS FOR DUMMIES MB&C2017 WORKSHOP
Jasper Decuyper BIOINFORMATICS FOR DUMMIES MB&C2017 WORKSHOP MB&C2017 Workshop Bioinformatics for dummies 2 INTRODUCTION Imagine your workspace without the computers Both in research laboratories and in
More informationBLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments
BLAST 100 times faster than dynamic programming. Good for database searches. Derive a list of words of length w from query (e.g., 3 for protein, 11 for DNA) High-scoring words are compared with database
More informationBioinformatics for Cell Biologists
Bioinformatics for Cell Biologists 15 19 March 2010 Developmental Biology and Regnerative Medicine (DBRM) Schedule Monday, March 15 09.00 11.00 Introduction to course and Bioinformatics (L1) D224 Helena
More informationBIMM 143: Introduction to Bioinformatics (Winter 2018)
BIMM 143: Introduction to Bioinformatics (Winter 2018) Course Instructor: Dr. Barry J. Grant ( bjgrant@ucsd.edu ) Course Website: https://bioboot.github.io/bimm143_w18/ DRAFT: 2017-12-02 (20:48:10 PST
More informationComparative Genomics. Page 1. REMINDER: BMI 214 Industry Night. We ve already done some comparative genomics. Loose Definition. Human vs.
Page 1 REMINDER: BMI 214 Industry Night Comparative Genomics Russ B. Altman BMI 214 CS 274 Location: Here (Thornton 102), on TV too. Time: 7:30-9:00 PM (May 21, 2002) Speakers: Francisco De La Vega, Applied
More informationProduct Applications for the Sequence Analysis Collection
Product Applications for the Sequence Analysis Collection Pipeline Pilot Contents Introduction... 1 Pipeline Pilot and Bioinformatics... 2 Sequence Searching with Profile HMM...2 Integrating Data in a
More informationFrom Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow
From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow Technical Overview Import VCF Introduction Next-generation sequencing (NGS) studies have created unanticipated challenges with
More informationGENETICS - CLUTCH CH.15 GENOMES AND GENOMICS.
!! www.clutchprep.com CONCEPT: OVERVIEW OF GENOMICS Genomics is the study of genomes in their entirety Bioinformatics is the analysis of the information content of genomes - Genes, regulatory sequences,
More informationThe Computational Impact of Genomics on Biotechnology R&D (sort of )
The Computational Impact of Genomics on Biotechnology R&D (sort of ) John Scooter Morris, Ph.D. Genentech, Inc. November 13, 2001 page 1 Biotechnology? Means many things to many people Genomics Gene therapy
More informationDiscovering gene regulatory control using ChIP-chip and ChIP-seq. An introduction to gene regulatory control, concepts and methodologies
Discovering gene regulatory control using ChIP-chip and ChIP-seq An introduction to gene regulatory control, concepts and methodologies Ian Simpson ian.simpson@.ed.ac.uk bit.ly/bio2_2012 The Central Dogma
More informationIntroduction to Bioinformatics and Gene Expression Technologies
Introduction to Bioinformatics and Gene Expression Technologies Utah State University Fall 2017 Statistical Bioinformatics (Biomedical Big Data) Notes 1 1 Vocabulary Gene: hereditary DNA sequence at a
More informationIntroduction to Bioinformatics and Gene Expression Technologies
Vocabulary Introduction to Bioinformatics and Gene Expression Technologies Utah State University Fall 2017 Statistical Bioinformatics (Biomedical Big Data) Notes 1 Gene: Genetics: Genome: Genomics: hereditary
More informationAGILENT S BIOINFORMATICS ANALYSIS SOFTWARE
ACCELERATING PROGRESS IS IN OUR GENES AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE GENESPRING GENE EXPRESSION (GX) MASS PROFILER PROFESSIONAL (MPP) PATHWAY ARCHITECT (PA) See Deeper. Reach Further. BIOINFORMATICS
More informationCOMPUTER RESOURCES II:
COMPUTER RESOURCES II: Using the computer to analyze data, using the internet, and accessing online databases Bio 210, Fall 2006 Linda S. Huang, Ph.D. University of Massachusetts Boston In the first computer
More informationBiology 644: Bioinformatics
Processes Activation Repression Initiation Elongation.... Processes Splicing Editing Degradation Translation.... Transcription Translation DNA Regulators DNA-Binding Transcription Factors Chromatin Remodelers....
More informationuser s guide Question 3
Question 3 During a positional cloning project aimed at finding a human disease gene, linkage data have been obtained suggesting that the gene of interest lies between two sequence-tagged site markers.
More informationBioinformatics Course AA 2017/2018 Tutorial 2
UNIVERSITÀ DEGLI STUDI DI PAVIA - FACOLTÀ DI SCIENZE MM.FF.NN. - LM MOLECULAR BIOLOGY AND GENETICS Bioinformatics Course AA 2017/2018 Tutorial 2 Anna Maria Floriano annamaria.floriano01@universitadipavia.it
More informationB I O I N F O R M A T I C S
B I O I N F O R M A T I C S Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be SUPPLEMENTARY CHAPTER: DATA BASES AND MINING 1 What
More informationData Mining for Biological Data Analysis
Data Mining for Biological Data Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Data Mining Course by Gregory-Platesky Shapiro available at www.kdnuggets.com Jiawei Han
More information2014 Pearson Education, Inc. CH 8: Recombinant DNA Technology
CH 8: Recombinant DNA Technology Biotechnology the use of microorganisms to make practical products Recombinant DNA = DNA from 2 different sources What is Recombinant DNA Technology? modifying genomes
More informationIntroduction to Molecular Biology
Introduction to Molecular Biology Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 2-1- Important points to remember We will study: Problems from bioinformatics. Algorithms used to solve
More informationGREG GIBSON SPENCER V. MUSE
A Primer of Genome Science ience THIRD EDITION TAGCACCTAGAATCATGGAGAGATAATTCGGTGAGAATTAAATGGAGAGTTGCATAGAGAACTGCGAACTG GREG GIBSON SPENCER V. MUSE North Carolina State University Sinauer Associates, Inc.
More informationTypes of Databases - By Scope
Biological Databases Bioinformatics Workshop 2009 Chi-Cheng Lin, Ph.D. Department of Computer Science Winona State University clin@winona.edu Biological Databases Data Domains - By Scope - By Level of
More informationJust the Facts: A Basic Introduction to the Science Underlying NCBI Resources
National Center for Biotechnology Information About NCBI NCBI at a Glance A Science Primer Human Genome Resources Model Organisms Guide Outreach and Education Databases and Tools News About NCBI Site Map
More informationAnnotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G
Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G Introduction: A genome is the total genetic content of
More informationDeakin Research Online
Deakin Research Online This is the published version: Church, Philip, Goscinski, Andrzej, Wong, Adam and Lefevre, Christophe 2011, Simplifying gene expression microarray comparative analysis., in BIOCOM
More informationCH 8: Recombinant DNA Technology
CH 8: Recombinant DNA Technology Biotechnology the use of microorganisms to make practical products Recombinant DNA = DNA from 2 different sources What is Recombinant DNA Technology? modifying genomes
More informationQuestion 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.
Bio4342 Exercise 1 Answers: Detecting and Interpreting Genetic Homology (Answers prepared by Wilson Leung) Question 1: Low complexity DNA can be described as sequences that consist primarily of one or
More informationSynthetic Biology. Sustainable Energy. Therapeutics Industrial Enzymes. Agriculture. Accelerating Discoveries, Expanding Possibilities. Design.
Synthetic Biology Accelerating Discoveries, Expanding Possibilities Sustainable Energy Therapeutics Industrial Enzymes Agriculture Design Build Generate Solutions to Advance Synthetic Biology Research
More informationIntroduction to Plant Genomics and Online Resources. Manish Raizada University of Guelph
Introduction to Plant Genomics and Online Resources Manish Raizada University of Guelph Genomics Glossary http://www.genomenewsnetwork.org/articles/06_00/sequence_primer.shtml Annotation Adding pertinent
More informationBiological databases an introduction
Biological databases an introduction By Dr. Erik Bongcam-Rudloff SGBC-SLU 2016 VALIDATION Experimental Literature Manual or semi-automatic computational analysis EXPERIMENTAL Costs Needs skilled manpower
More informationIntroduction to 'Omics and Bioinformatics
Introduction to 'Omics and Bioinformatics Chris Overall Department of Bioinformatics and Genomics University of North Carolina Charlotte Acquire Store Analyze Visualize Bioinformatics makes many current
More informationMotivation From Protein to Gene
MOLECULAR BIOLOGY 2003-4 Topic B Recombinant DNA -principles and tools Construct a library - what for, how Major techniques +principles Bioinformatics - in brief Chapter 7 (MCB) 1 Motivation From Protein
More informationNUCLEIC ACIDS. DNA (Deoxyribonucleic Acid) and RNA (Ribonucleic Acid): information storage molecules made up of nucleotides.
NUCLEIC ACIDS DNA (Deoxyribonucleic Acid) and RNA (Ribonucleic Acid): information storage molecules made up of nucleotides. Base Adenine Guanine Cytosine Uracil Thymine Abbreviation A G C U T DNA RNA 2
More informationComplete draft sequence 2001
Genomes: What we know and what we don t know Complete draft sequence 2001 November11, 2009 Dr. Stefan Maas, BioS Lehigh U. What we know Raw genome data The range of genome sizes in the animal & plant kingdoms
More informationMate-pair library data improves genome assembly
De Novo Sequencing on the Ion Torrent PGM APPLICATION NOTE Mate-pair library data improves genome assembly Highly accurate PGM data allows for de Novo Sequencing and Assembly For a draft assembly, generate
More informationBig picture and history
Big picture and history (and Computational Biology) CS-5700 / BIO-5323 Outline 1 2 3 4 Outline 1 2 3 4 First to be databased were proteins The development of protein- s (Sanger and Tuppy 1951) led to the
More informationBME 110 Midterm Examination
BME 110 Midterm Examination May 10, 2011 Name: (please print) Directions: Please circle one answer for each question, unless the question specifies "circle all correct answers". You can use any resource
More informationNext Generation Sequencing. Tobias Österlund
Next Generation Sequencing Tobias Österlund tobiaso@chalmers.se NGS part of the course Week 4 Friday 13/2 15.15-17.00 NGS lecture 1: Introduction to NGS, alignment, assembly Week 6 Thursday 26/2 08.00-09.45
More informationGenomes: What we know and what we don t know
Genomes: What we know and what we don t know Complete draft sequence 2001 October 15, 2007 Dr. Stefan Maas, BioS Lehigh U. What we know Raw genome data The range of genome sizes in the animal & plant kingdoms!
More informationearray 5.0 Create your own Custom Microarray Design
earray 5.0 Create your own Custom Microarray Design http://earray.chem.agilent.com earray 5.x Overview Session Summary Session Summary Agilent Genomics Microarray Solution earray Functional Overview Gene
More informationRNA-Seq data analysis course September 7-9, 2015
RNA-Seq data analysis course September 7-9, 2015 Peter-Bram t Hoen (LUMC) Jan Oosting (LUMC) Celia van Gelder, Jacintha Valk (BioSB) Anita Remmelzwaal (LUMC) Expression profiling DNA mrna protein Comprehensive
More informationFiles for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]
BLAST Exercise: Detecting and Interpreting Genetic Homology Adapted by W. Leung and SCR Elgin from Detecting and Interpreting Genetic Homology by Dr. J. Buhler Prequisites: None Resources: The BLAST web
More informationLAB. WALRUSES AND WHALES AND SEALS, OH MY!
Name Period Date LAB. WALRUSES AND WHALES AND SEALS, OH MY! Walruses and whales are both marine mammals. So are dolphins, seals, and manatee. They all have streamlined bodies, legs reduced to flippers,
More informationBasics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility
2018 ABRF Meeting Satellite Workshop 4 Bridging the Gap: Isolation to Translation (Single Cell RNA-Seq) Sunday, April 22 Basics of RNA-Seq (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly,
More informationEnsembl workshop. Thomas Randall, PhD bioinformatics.unc.edu. handouts, papers, datasets
Ensembl workshop Thomas Randall, PhD tarandal@email.unc.edu bioinformatics.unc.edu www.unc.edu/~tarandal/ensembl handouts, papers, datasets Ensembl is a joint project between EMBL - EBI and the Sanger
More informationChIP-seq and RNA-seq
ChIP-seq and RNA-seq Biological Goals Learn how genomes encode the diverse patterns of gene expression that define each cell type and state. Protein-DNA interactions (ChIPchromatin immunoprecipitation)
More informationChIP-seq and RNA-seq. Farhat Habib
ChIP-seq and RNA-seq Farhat Habib fhabib@iiserpune.ac.in Biological Goals Learn how genomes encode the diverse patterns of gene expression that define each cell type and state. Protein-DNA interactions
More informationOverview of Health Informatics. ITI BMI-Dept
Overview of Health Informatics ITI BMI-Dept Fellowship Week 5 Overview of Health Informatics ITI, BMI-Dept Day 10 7/5/2010 2 Agenda 1-Bioinformatics Definitions 2-System Biology 3-Bioinformatics vs Computational
More informationSequence Based Function Annotation
Sequence Based Function Annotation Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University Sequence Based Function Annotation 1. Given a sequence, how to predict its biological
More informationEngineering Genetic Circuits
Engineering Genetic Circuits I use the book and slides of Chris J. Myers Lecture 0: Preface Chris J. Myers (Lecture 0: Preface) Engineering Genetic Circuits 1 / 19 Samuel Florman Engineering is the art
More informationFast, Accurate and Sensitive DNA Variant Detection from Sanger Sequencing:
Fast, Accurate and Sensitive DNA Variant Detection from Sanger Sequencing: Patented, Anti-Correlation Technology Provides 99.5% Accuracy & Sensitivity to 5% Variant Knowledge Base and External Annotation
More informationMATH 5610, Computational Biology
MATH 5610, Computational Biology Lecture 2 Intro to Molecular Biology (cont) Stephen Billups University of Colorado at Denver MATH 5610, Computational Biology p.1/24 Announcements Error on syllabus Class
More informationC3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère
C3BI VARIANTS CALLING November 2016 Pierre Lechat Stéphane Descorps-Declère General Workflow (GATK) software websites software bwa picard samtools GATK IGV tablet vcftools website http://bio-bwa.sourceforge.net/
More informationIntroduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013
Introduction to RNA-Seq David Wood Winter School in Mathematics and Computational Biology July 1, 2013 Abundance RNA is... Diverse Dynamic Central DNA rrna Epigenetics trna RNA mrna Time Protein Abundance
More informationFundamentals of Bioinformatics: computation, biology, computational biology
Fundamentals of Bioinformatics: computation, biology, computational biology Vasilis J. Promponas Bioinformatics Research Laboratory Department of Biological Sciences University of Cyprus A short self-introduction
More informationNiceProt View of Swiss-Prot: P18907
Hosted by NCSC US ExPASy Home page Site Map Search ExPASy Contact us Swiss-Prot Mirror sites: Australia Bolivia Canada China Korea Switzerland Taiwan Search Swiss-Prot/TrEMBL for horse alpha Go Clear NiceProt
More informationData Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis
Data Basics Josef K Vogt Slides by: Simon Rasmussen 2017 Generalized NGS analysis Sample prep & Sequencing Data size Main data reductive steps SNPs, genes, regions Application Assembly: Compare Raw Pre-
More informationProcessing Very Large Genomic Files
Processing Very Large Genomic Files Michael Robinson School of Computer Information Science Florida International University Miami, Florida, USA michael.robinson@cs.fiu.edu Abstract We have developed a
More informationA WEB-BASED TOOL FOR GENOMIC FUNCTIONAL ANNOTATION, STATISTICAL ANALYSIS AND DATA MINING
A WEB-BASED TOOL FOR GENOMIC FUNCTIONAL ANNOTATION, STATISTICAL ANALYSIS AND DATA MINING D. Martucci a, F. Pinciroli a,b, M. Masseroli a a Dipartimento di Bioingegneria, Politecnico di Milano, Milano,
More informationGene Identification in silico
Gene Identification in silico Nita Parekh, IIIT Hyderabad Presented at National Seminar on Bioinformatics and Functional Genomics, at Bioinformatics centre, Pondicherry University, Feb 15 17, 2006. Introduction
More informationBioinformation by Biomedical Informatics Publishing Group
Algorithm to find distant repeats in a single protein sequence Nirjhar Banerjee 1, Rangarajan Sarani 1, Chellamuthu Vasuki Ranjani 1, Govindaraj Sowmiya 1, Daliah Michael 1, Narayanasamy Balakrishnan 2,
More information