Introduction to Bioinformatics and Gene Expression Technologies

Similar documents
Functional Genomics Overview RORY STARK PRINCIPAL BIOINFORMATICS ANALYST CRUK CAMBRIDGE INSTITUTE 18 SEPTEMBER 2017

Gene Expression Technology

Lecture #1. Introduction to microarray technology

Bioinformatics Advice on Experimental Design

Introduction to BioMEMS & Medical Microdevices DNA Microarrays and Lab-on-a-Chip Methods

Next Gen Sequencing. Expansion of sequencing technology. Contents

Outline. General principles of clonal sequencing Analysis principles Applications CNV analysis Genome architecture

Recent technology allow production of microarrays composed of 70-mers (essentially a hybrid of the two techniques)

3.1.4 DNA Microarray Technology

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

DNA Arrays Affymetrix GeneChip System

Next Generation Sequencing: An Overview

Microarray Technique. Some background. M. Nath

Next-Generation Sequencing. Technologies

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

EECS730: Introduction to Bioinformatics

Welcome to the NGS webinar series

What we ll do today. Types of stem cells. Do engineered ips and ES cells have. What genes are special in stem cells?

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Introductory Next Gen Workshop

Do engineered ips and ES cells have similar molecular signatures?

Methods of Biomaterials Testing Lesson 3-5. Biochemical Methods - Molecular Biology -

Outline. Analysis of Microarray Data. Most important design question. General experimental issues

Gene Regulation Solutions. Microarrays and Next-Generation Sequencing

Research school methods seminar Genomics and Transcriptomics

DNA Microarray Data Oligonucleotide Arrays

Third Generation Sequencing

Genome Sequencing. I: Methods. MMG 835, SPRING 2016 Eukaryotic Molecular Genetics. George I. Mias

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

Technical Review. Real time PCR

Analysis of Microarray Data

Goals of pharmacogenomics

Human genome sequence

Analysing genomes and transcriptomes using Illumina sequencing

Next Generation Sequencing. Jeroen Van Houdt - Leuven 13/10/2017

Measuring transcriptomes with RNA-Seq

Philippe Hupé 1,2. The R User Conference 2009 Rennes

CSC Assignment1SequencingReview- 1109_Su N_NEXT_GENERATION_SEQUENCING.docx By Anonymous. Similarity Index

Sequencing techniques and applications

less sensitive than RNA-seq but more robust analysis pipelines expensive but quantitiatve standard but typically not high throughput

An introduction to RNA-Seq. Brian J. Knaus USDA Forest Service Pacific Northwest Research Station

Class Information. Introduction to Genome Biology and Microarray Technology. Biostatistics Rafael A. Irizarry. Lecture 1

Introductie en Toepassingen van Next-Generation Sequencing in de Klinische Virologie. Sander van Boheemen Medical Microbiology

QIAGEN s NGS Solutions for Biomarkers NGS & Bioinformatics team QIAGEN (Suzhou) Translational Medicine Co.,Ltd

Computational Biology I LSM5191

Next Generation Sequencing Lecture Saarbrücken, 19. March Sequencing Platforms

Next Generation Sequencing (NGS) Market Size, Growth and Trends ( )

Engineering Genetic Circuits

Feature Selection of Gene Expression Data for Cancer Classification: A Review

Typical probes. Slides per pack Aminosilane. Long oligo- Slide AStar None D surface. nucleotides

Functional Genomics in Plants

Next Generation Sequencing Technologies. Some slides are modified from Robi Mitra s lecture notes

Humboldt Universität zu Berlin. Grundlagen der Bioinformatik SS Microarrays. Lecture

NPTEL VIDEO COURSE PROTEOMICS PROF. SANJEEVA SRIVASTAVA

DNA-Sequencing. Technologies & Devices. Matthias Platzer. Genome Analysis Leibniz Institute on Aging - Fritz Lipmann Institute (FLI)

Data Sheet. GeneChip Human Genome U133 Arrays

Introduction to Molecular Biology

DNA-Sequencing. Technologies & Devices. Matthias Platzer. Genome Analysis Leibniz Institute on Aging - Fritz Lipmann Institute (FLI)

Introduction to Bioinformatics. Fabian Hoti 6.10.

Chapter 15 Gene Technologies and Human Applications

Bioinformatics of Transcriptional Regulation

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Ultrasequencing: Methods and Applications of the New Generation Sequencing Platforms

Sanger vs Next-Gen Sequencing

MICROARRAYS+SEQUENCING

Advances in analytical biochemistry and systems biology: Proteomics

Genetic Engineering & Recombinant DNA

Microarray Gene Expression Analysis at CNIO

Program overview. SciLifeLab - a short introduction. Advanced Light Microscopy. Affinity Proteomics. Bioinformatics.

Expression Array System

measuring gene expression December 5, 2017

Serial Analysis of Gene Expression

AGRO/ANSC/BIO/GENE/HORT 305 Fall, 2016 Overview of Genetics Lecture outline (Chpt 1, Genetics by Brooker) #1

Introduction to Bioinformatics

Proteomics And Cancer Biomarker Discovery. Dr. Zahid Khan Institute of chemical Sciences (ICS) University of Peshawar. Overview. Cancer.

Genetics Lecture 21 Recombinant DNA

Bioinformatics. Ingo Ruczinski. Some selected examples... and a bit of an overview

SIMS2003. Instructors:Rus Yukhananov, Alex Loguinov BWH, Harvard Medical School. Introduction to Microarray Technology.

Chromosomes. Chromosomes. Genes. Strands of DNA that contain all of the genes an organism needs to survive and reproduce

High Cross-Platform Genotyping Concordance of Axiom High-Density Microarrays and Eureka Low-Density Targeted NGS Assays

Gene expression bioinformatics: Part 1. High-throughput quantitative genomics

Molecular Cell Biology - Problem Drill 11: Recombinant DNA

Péter Antal Ádám Arany Bence Bolgár András Gézsi Gergely Hajós Gábor Hullám Péter Marx András Millinghoffer László Poppe Péter Sárközy BIOINFORMATICS

Multi-omics in biology: integration of omics techniques

Year III Pharm.D Dr. V. Chitra

Genome Sequence Assembly

What is a microarray

O C. 5 th C. 3 rd C. the national health museum

Roche Molecular Biochemicals Technical Note No. LC 10/2000

CMPS 3110 : Bioinformatics. High-Throughput Sequencing and Applications

Molecular Markers CRITFC Genetics Workshop December 9, 2014

Measuring transcriptomes with RNA-Seq. BMI/CS 776 Spring 2016 Anthony Gitter

Identifying Candidate Informative Genes for Biomarker Prediction of Liver Cancer

Long and short/small RNA-seq data analysis

Introduction to the UCSC genome browser

Next Generation Sequencing. Dylan Young Biomedical Engineering

Multiple choice questions (numbers in brackets indicate the number of correct answers)

Molecular Biology Primer. CptS 580, Computational Genomics, Spring 09

Introduction to BIOINFORMATICS

Transcription:

Introduction to Bioinformatics and Gene Expression Technologies Utah State University Fall 2017 Statistical Bioinformatics (Biomedical Big Data) Notes 1 1

Vocabulary Gene: hereditary DNA sequence at a specific location on chromosome (that does something ) Genetics: Genome: study of heredity & variation in organisms an organism s total genetic content (full DNA sequence) Genomics: study of organisms in terms of their genome 2

Vocabulary Protein: sequence of amino acids that does something Proteomics: Phylogeny: Phylogenetics: study of all of the proteins that can come from an organisms genome the evolutionary or historical development of an organism (or its DNA sequence) the study of an organism s phylogeny Phenotype: the physical characteristic of interest in each individual for example, plant height, disease status, or embryo type 3

Vocabulary Bioinformatics: the collection, organization, & analysis of largescale, complex biological data Statistical Bioinformatics: the application of statistical approaches to bioinformatics, especially in identifying significant changes (in sequences, expression patterns, etc.) that are biologically relevant (especially in affecting the phenotype) 4

Central Dogma of Molecular Biology 5

A road map to bioinformatics Central Dogma Technology Gene Genome Sequencing Genomic Hypothesis Genotype QTL Type of Study or Analysis mrna transcript Transcript Profiling Transcriptome Microarrays or Next-Gen Sequencing (Epigenetics / methylation) Protein Protein quantification and function Proteome Protein Microarrays or Proteomics Phenotype (From introductory lecture by RW Doerge at 2013 Joint Statistical Meetings) 6

Alphabets DNA sequences defined by nucleotides (4) DNA sequence mrna sequence Protein sequence Protein sequences defined by amino acids (20) 7

General assumption of gene expression technology Use mrna transcript abundance level as a measure of the level of expression for the corresponding gene Proportional to degree of gene expression Side note: a methylated gene is silenced (no expression) 8

How to measure mrna abundance? Several different approaches with similar themes: Affymetrix GeneChip Nimblegen array Two-color cdna array More modern: next-generation sequencing (NGS) Representation of genes on slide Small portion of gene ( oligo ) Larger sequence of gene Blank slate (NGS) oligonucleotide arrays 9

General DNA sequencing Sanger 1970 s today most reliable, but expensive Next-generation [high-throughput] (NGS): Genome Sequencer FLC (GS FLX, by 454 Sequencing) Illumina s Solexa Genome Analyzer Applied Biosystems SOLiD platform others Key aspect: sequence (and identify) all sequences present 10

Common features of NGS technologies (1) fragment prepared genomic material biological system s RNA molecules RNA-Seq DNA or RNA interaction regions ChIP-Seq, HITS-CLIP others sequence these fragments (at least partially) produces HUGE data files (~10 million fragments sequenced) 11

Common features of NGS technologies (2) align sequenced fragments with reference sequence usually, a known target genome (gigo ) alignment tools: ELAND, MAQ, SOAP, Bowtie, others often done with command-line tools still a major computational challenge count number of fragments mapping to certain regions usually, genes these read counts linearly approximate target transcript abundance 12

Here, RNA-Seq: recall central dogma: DNA mrna protein action quantify [mrna] transcript abundance Isolate RNA from cells, fragment at random positions, and copy into cdna Attach adapters to ends of cdna fragments, and bind to flow cell (Illumina has glass slide with 8 such lanes so can process 8 samples on one slide) Amplify cdna fragments in certain size range (e.g., 200-300 bases) using PCR clusters of same fragment Sequence base-by-base for all clusters in parallel https://www.youtube.com/watch?v=-7gk1hxwcte 13

(originally illumina.com download) 14

(originally illumina.com download) 15

(orginally illumina.com download) 16

(orginally illumina.com download) 17

Then align and map For sequence at each cluster, compare to [align with] reference genome; file format: millions of clusters per lane approx. 1 GB file size per lane For regions of interest in reference genome (genes, here), count number of clusters mapping there requires well-studied and well-documented genome 18

RNA-Seq Example: 8 patients, 56,621 genes 8 heart tissue samples 4 control (no heart disease) 4 cardiomyopathy (heart disease) 2 restrictive (contracts okay, relaxes abnormally) 2 dilated (enlarged left ventricle) These Naples data made public Nov 2015 by Institute of Genetics and Biophysics (Naples, Italy) Ctrl_3 RCM_3 Ctrl_4 DCM_4 Ctrl_5 RCM_5 Ctrl_6 DCM_6 ENSG00000000003 308 498 362 554 351 353 220 309 ENSG00000000005 3 164 2 43 13 83 22 16 ENSG00000000419 1187 1249 1096 1303 970 863 637 684 ENSG00000000457 163 239 168 195 153 194 44 117 ENSG00000000460 63 108 83 109 87 43 54 51 ENSG00000000938 369 328 272 669 1216 193 861 292... http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse71613 19

Common statistical research objectives Test each gene (row) for differential expression between conditions Ctrl vs. non-ctrl Dilated vs. Restrictive Restrictive vs. Ctrl etc. Test specific groups of genes (with a known common function) for overall expression differences between conditions Which functions are differentially active between Ctrl and non-ctrl, for example? 20

A short word on bioinformatic technologies Never marry a technology, because it will always leave you. Scott Tingey, Director of Genetic Discovery at DuPont (shared in RW Doerge 2013 introductory overview lecture at 2013 JSM) In this class, we will discuss only a couple of technologies, emphasizing their recurring statistical issues These are perpetual (and compounding) 21

A Rough Timeline of Technologies (1995+) Microarrays require probes fixed in advance only set up to detect those (2005+) Next-Generation Sequencing (NGS) typically involves amplification of genomic material (PCR) (2010+) Third-Generation Sequencing next-next-generation Pac Bio, Ion Torrent no amplification needed can sequence single molecule longer reads possible; still (2013 ; 2016) showing high errors (2012+) Nanopore-Based Sequencing [very promising] Oxford Nanopore, Genia, others bases identified as whole molecule slips through nanoscale hole (like threading a needle); coupled with disposable cartridges; still (2013 ; 2016) under development (?+) more Differ in how sequencing done; subsequent postalignment statistical analysis basically same (see 2016 Goodwin et al. paper on Canvas course page, in Files) 22

Affymetrix Technology GeneChip Each gene is represented by a unique set of probe pairs (usually 12-20 probe pairs per probe set) Each spot on array represents a single probe (with millions of copies) These probes are fixed to the array (Image courtesy Affymetrix, www.affymetrix.com) 23

Affymetrix Technology Expression A tissue sample is prepared so that its mrna has fluorescent tags; wait for hybridization; scan to light tag (Images courtesy Affymetrix, www.affymetrix.com) 24

Affymetrix GeneChip Image courtesy Affymetrix, www.affymetrix.com 25

Cartoon Representations (originally from Affymetrix outreach) Animation 1: GeneChip structure (1 min.) Animation 2: Measuring gene expression (2.5 min) 26

Images; Affymetrix data is probe intensity Full Array Image Close-up of Array Image Images courtesy Affymetrix, www.affymetrix.com 27

How to analyze data meaningfully? Consider (for any technology): Data quality Data distribution Data format & organization Appropriateness of measurement methods (& variance) Sources of variability (and their types) Appropriate models to account for sources of variability and address question of interest Meaning of P-values and appropriate tests of significance Statistical significance vs. biological relevance Appropriate and useful representation of results Many useful tools available from Bioconductor 28

The Bioconductor Project Bioconductor is an open source and open development software project for the analysis and comprehension of genomic data Not just for RNA-Seq or microarray data Like a living family of software packages, changing with needs Core team mainly at Fred Hutchinson Cancer Research, plus many other U.S. and international institutions Source: www.bioconductor.org 29

Main Features of the Bioconductor Project Use of R Documentation and reproducible research Statistical and graphical methods Annotation Short courses Open source Open development Source: www.bioconductor.org 30

What will we do in this class? Learn basics of a few major Bioconductor tools Focus on statistical issues Discuss recent developments Learn to discuss all of this 31