Introduction to Bioinformatics and Gene Expression Technologies

Similar documents
Introduction to Bioinformatics and Gene Expression Technologies

Introduction to Bioinformatics and Gene Expression Technology

Functional Genomics Overview RORY STARK PRINCIPAL BIOINFORMATICS ANALYST CRUK CAMBRIDGE INSTITUTE 18 SEPTEMBER 2017

Advanced Statistical Methods: Beyond Linear Regression

Gene Expression Technology

Gene expression analysis. Biosciences 741: Genomics Fall, 2013 Week 5. Gene expression analysis

Introduction to Microarray Analysis

Measuring and Understanding Gene Expression

RNA-Seq data analysis course September 7-9, 2015

Lecture #1. Introduction to microarray technology

1. Introduction Gene regulation Genomics and genome analyses

Bioinformatics Advice on Experimental Design

Moc/Bio and Nano/Micro Lee and Stowell

Introduction to BioMEMS & Medical Microdevices DNA Microarrays and Lab-on-a-Chip Methods

Deoxyribonucleic Acid DNA

Chapter 1. from genomics to proteomics Ⅱ

Biology 644: Bioinformatics

Outline. General principles of clonal sequencing Analysis principles Applications CNV analysis Genome architecture

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

Next Gen Sequencing. Expansion of sequencing technology. Contents

Matthew Tinning Australian Genome Research Facility. July 2012

Next generation sequencing techniques" Toma Tebaldi Centre for Integrative Biology University of Trento

CAP BIOINFORMATICS Su-Shing Chen CISE. 10/5/2005 Su-Shing Chen, CISE 1

Welcome! Introduction to High Throughput Genomics December Norwegian Microarray Consortium FUGE Bioinformatics platform

Introduction to gene expression microarray data analysis

Expressed genes profiling (Microarrays) Overview Of Gene Expression Control Profiling Of Expressed Genes

Microarrays: since we use probes we obviously must know the sequences we are looking at!

Overview of Next Generation Sequencing technologies. Céline Keime

6. GENE EXPRESSION ANALYSIS MICROARRAYS

Outline. Array platform considerations: Comparison between the technologies available in microarrays

Introduction into single-cell RNA-seq. Kersti Jääger 19/02/2014

Recent technology allow production of microarrays composed of 70-mers (essentially a hybrid of the two techniques)

Wheat CAP Gene Expression with RNA-Seq

GREG GIBSON SPENCER V. MUSE

2/5/16. Honeypot Ants. DNA sequencing, Transcriptomics and Genomics. Gene sequence changes? And/or gene expression changes?

Next-generation sequencing technologies

3.1.4 DNA Microarray Technology

Marker types. Potato Association of America Frederiction August 9, Allen Van Deynze

Aaron Liston, Oregon State University Botany 2012 Intro to Next Generation Sequencing Workshop

Intro to Microarray Analysis. Courtesy of Professor Dan Nettleton Iowa State University (with some edits)

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

The Journey of DNA Sequencing. Chromosomes. What is a genome? Genome size. H. Sunny Sun

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility

DNA/RNA MICROARRAYS NOTE: USE THIS KIT WITHIN 6 MONTHS OF RECEIPT.

SolCAP. Executive Commitee : David Douches Walter De Jong Robin Buell David Francis Alexandra Stone Lukas Mueller AllenVan Deynze

Computing with large data sets

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Please purchase PDFcamp Printer on to remove this watermark. DNA microarray

BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology. Lecture 2: Microarray analysis

Genome 373: High- Throughput DNA Sequencing. Doug Fowler

Next Generation Sequencing: An Overview

Chapter 6 - Molecular Genetic Techniques

Application of NGS (nextgeneration. for studying RNA regulation. Sung Wook Chi. Sungkyunkwan University (SKKU) Samsung Medical Center (SMC)

Bioinformatics: Microarray Technology. Assc.Prof. Chuchart Areejitranusorn AMS. KKU.

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Contact us for more information and a quotation

Modern Epigenomics. Histone Code

Processing Data from Next Generation Sequencing

Bioinformatics. Outline of lecture

DNA Chip Technology Benedikt Brors Dept. Intelligent Bioinformatics Systems German Cancer Research Center

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Understanding the science and technology of whole genome sequencing

Welcome to the NGS webinar series

Human Genomics. Higher Human Biology

DNA Microarray Technology

Finding Genes with Genomics Technologies

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Monday September 15, 2014

Microarray Technique. Some background. M. Nath

Single-Cell Whole Transcriptome Profiling With the SOLiD. System

Genetics and Bioinformatics

Next-Generation Sequencing. Technologies

RNA-SEQUENCING ANALYSIS

COS 597c: Topics in Computational Molecular Biology. DNA arrays. Background

Motivation From Protein to Gene

G E N OM I C S S E RV I C ES

Biochemistry 412. DNA Microarrays. April 1, 2008

Introduction to biology and measurement of gene expression

MICROARRAYS: CHIPPING AWAY AT THE MYSTERIES OF SCIENCE AND MEDICINE

TECH NOTE Pushing the Limit: A Complete Solution for Generating Stranded RNA Seq Libraries from Picogram Inputs of Total Mammalian RNA

What we ll do today. Types of stem cells. Do engineered ips and ES cells have. What genes are special in stem cells?

Applications of short-read

Third Generation Sequencing

Deep Sequencing technologies

Do engineered ips and ES cells have similar molecular signatures?

Methods of Biomaterials Testing Lesson 3-5. Biochemical Methods - Molecular Biology -

EECS730: Introduction to Bioinformatics

Phenotype analysis: biological-biochemical analysis. Genotype analysis: molecular and physical analysis

Biochemistry 412. DNA Microarrays. March 30, 2007

resequencing storage SNP ncrna metagenomics private trio de novo exome ncrna RNA DNA bioinformatics RNA-seq comparative genomics

Advanced Technology in Phytoplasma Research

Outline General NGS background and terms 11/14/2016 CONFLICT OF INTEREST. HLA region targeted enrichment. NGS library preparation methodologies

Gene expression analysis. Gene expression analysis. Total RNA. Rare and abundant transcripts. Expression levels. Transcriptional output of the genome

DNA Microarray Technology

Exome Sequencing Exome sequencing is a technique that is used to examine all of the protein-coding regions of the genome.

Microbial Metabolism Systems Microbiology

Introduction to Bioinformatics

TREE CODE PRODUCT BROCHURE

SGN-6106 Computational Systems Biology I

Introductory Next Gen Workshop

Machine Learning. HMM applications in computational biology

Transcription:

Vocabulary Introduction to Bioinformatics and Gene Expression Technologies Utah State University Fall 2017 Statistical Bioinformatics (Biomedical Big Data) Notes 1 Gene: Genetics: Genome: Genomics: hereditary DNA sequence at a specific location on chromosome (that does something ) study of heredity & variation in organisms an organism s total genetic content (full DNA sequence) study of organisms in terms of their genome 1 2 Vocabulary Protein: sequence of amino acids that does something Vocabulary Proteomics: Phylogeny: study of all of the proteins that can come from an organisms genome the evolutionary or historical development of an organism (or its DNA sequence) Bioinformatics: the collection, organization, & analysis of largescale, complex biological data Statistical Bioinformatics: Phylogenetics: Phenotype: the study of an organism s phylogeny the physical characteristic of interest in each individual for example, plant height, disease status, or embryo type the application of statistical approaches to bioinformatics, especially in identifying significant changes (in sequences, expression patterns, etc.) that are biologically relevant (especially in affecting the phenotype) 3 4

Central Dogma of Molecular Biology A road map to bioinformatics Central Dogma Technology Gene Genome Sequencing Genomic Hypothesis Genotype QTL Type of Study or Analysis mrna transcript Protein Transcript Profiling Transcriptome Protein quantification and function Proteome Microarrays or Next-Gen Sequencing (Epigenetics / methylation) Protein Microarrays or Proteomics 5 Phenotype (From introductory lecture by RW Doerge at 2013 Joint Statistical Meetings) 6 Alphabets DNA sequences defined by nucleotides (4) DNA sequence mrna sequence Protein sequence Protein sequences defined by amino acids (20) General assumption of gene expression technology Use mrna transcript abundance level as a measure of the level of expression for the corresponding gene Proportional to degree of gene expression Side note: a methylated gene is silenced (no expression) 7 8

How to measure mrna abundance? Several different approaches with similar themes: Affymetrix GeneChip oligonucleotide Nimblegen array arrays Two-color cdna array More modern: next-generation sequencing (NGS) Representation of genes on slide Small portion of gene ( oligo ) Larger sequence of gene Blank slate (NGS) General DNA sequencing Sanger 1970 s today most reliable, but expensive Next-generation [high-throughput] (NGS): Genome Sequencer FLC (GS FLX, by 454 Sequencing) Illumina s Solexa Genome Analyzer Applied Biosystems SOLiD platform others Key aspect: sequence (and identify) all sequences present 9 10 Common features of NGS technologies (1) fragment prepared genomic material biological system s RNA molecules RNA-Seq DNA or RNA interaction regions ChIP-Seq, HITS-CLIP others sequence these fragments (at least partially) produces HUGE data files (~10 million fragments sequenced) Common features of NGS technologies (2) align sequenced fragments with reference sequence usually, a known target genome (gigo ) alignment tools: ELAND, MAQ, SOAP, Bowtie, others often done with command-line tools still a major computational challenge count number of fragments mapping to certain regions usually, genes these read counts linearly approximate target transcript abundance 11 12

Here, RNA-Seq: recall central dogma: DNA mrna protein action quantify [mrna] transcript abundance Isolate RNA from cells, fragment at random positions, and copy into cdna Attach adapters to ends of cdna fragments, and bind to flow cell (Illumina has glass slide with 8 such lanes so can process 8 samples on one slide) Amplify cdna fragments in certain size range (e.g., 200-300 bases) using PCR clusters of same fragment Sequence base-by-base for all clusters in parallel https://www.youtube.com/watch?v=-7gk1hxwcte 13 (originally illumina.com download) 14 (originally illumina.com download) 15 (orginally illumina.com download) 16

Then align and map For sequence at each cluster, compare to [align with] reference genome; file format: millions of clusters per lane approx. 1 GB file size per lane For regions of interest in reference genome (genes, here), count number of clusters mapping there requires well-studied and well-documented genome (orginally illumina.com download) 17 18 RNA-Seq Example: 8 patients, 56,621 genes 8 heart tissue samples 4 control (no heart disease) 4 cardiomyopathy (heart disease) 2 restrictive (contracts okay, relaxes abnormally) 2 dilated (enlarged left ventricle) These Naples data made public Nov 2015 by Institute of Genetics and Biophysics (Naples, Italy) Ctrl_3 RCM_3 Ctrl_4 DCM_4 Ctrl_5 RCM_5 Ctrl_6 DCM_6 ENSG00000000003 308 498 362 554 351 353 220 309 ENSG00000000005 3 164 2 43 13 83 22 16 ENSG00000000419 1187 1249 1096 1303 970 863 637 684 ENSG00000000457 163 239 168 195 153 194 44 117 ENSG00000000460 63 108 83 109 87 43 54 51 ENSG00000000938 369 328 272 669 1216 193 861 292... http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse71613 19 Common statistical research objectives Test each gene (row) for differential expression between conditions Ctrl vs. non-ctrl Dilated vs. Restrictive Restrictive vs. Ctrl etc. Test specific groups of genes (with a known common function) for overall expression differences between conditions Which functions are differentially active between Ctrl and non-ctrl, for example? 20

A short word on bioinformatic technologies Never marry a technology, because it will always leave you. Scott Tingey, Director of Genetic Discovery at DuPont (shared in RW Doerge 2013 introductory overview lecture at 2013 JSM) In this class, we will discuss only a couple of technologies, emphasizing their recurring statistical issues These are perpetual (and compounding) A Rough Timeline of Technologies (1995+) Microarrays require probes fixed in advance only set up to detect those (2005+) Next-Generation Sequencing (NGS) typically involves amplification of genomic material (PCR) (2010+) Third-Generation Sequencing next-next-generation Pac Bio, Ion Torrent no amplification needed can sequence single molecule longer reads possible; still (2013 ; 2016) showing high errors (2012+) Nanopore-Based Sequencing [very promising] Oxford Nanopore, Genia, others bases identified as whole molecule slips through nanoscale hole (like threading a needle); coupled with disposable cartridges; still (2013 ; 2016) under development (?+) more Differ in how sequencing done; subsequent postalignment statistical analysis basically same 21 (see 2016 Goodwin et al. paper on Canvas course page, in Files) 22 Affymetrix Technology GeneChip Affymetrix Technology Expression Each gene is represented by a unique set of probe pairs (usually 12-20 probe pairs per probe set) Each spot on array represents a single probe (with millions of copies) These probes are fixed to the array A tissue sample is prepared so that its mrna has fluorescent tags; wait for hybridization; scan to light tag (Image courtesy Affymetrix, www.affymetrix.com) (Images courtesy Affymetrix, www.affymetrix.com) 23 24

Affymetrix GeneChip Cartoon Representations (originally from Affymetrix outreach) Animation 1: GeneChip structure (1 min.) Animation 2: Measuring gene expression (2.5 min) Image courtesy Affymetrix, www.affymetrix.com 25 26 Images; Affymetrix data is probe intensity How to analyze data meaningfully? Consider (for any technology): Data quality Data distribution Data format & organization Appropriateness of measurement methods (& variance) Sources of variability (and their types) Appropriate models to account for sources of variability and address question of interest Meaning of P-values and appropriate tests of significance Statistical significance vs. biological relevance Appropriate and useful representation of results Full Array Image Close-up of Array Image Many useful tools available from Bioconductor Images courtesy Affymetrix, www.affymetrix.com 27 28

The Bioconductor Project Main Features of the Bioconductor Project Bioconductor is an open source and open development software project for the analysis and comprehension of genomic data Not just for RNA-Seq or microarray data Like a living family of software packages, changing with needs Core team mainly at Fred Hutchinson Cancer Research, plus many other U.S. and international institutions Use of R Documentation and reproducible research Statistical and graphical methods Annotation Short courses Open source Open development Source: www.bioconductor.org 29 Source: www.bioconductor.org 30 What will we do in this class? Learn basics of a few major Bioconductor tools Focus on statistical issues Discuss recent developments Learn to discuss all of this 31