Bioinformatics and Genomics: A New SP Frontier?

Similar documents
Data Mining For Genomics

Data Mining For Genomics

Microarray Technique. Some background. M. Nath

DNA Arrays Affymetrix GeneChip System

3.1.4 DNA Microarray Technology

Recent technology allow production of microarrays composed of 70-mers (essentially a hybrid of the two techniques)

Introduction to BioMEMS & Medical Microdevices DNA Microarrays and Lab-on-a-Chip Methods

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Lecture #1. Introduction to microarray technology

Gene Expression Technology

Bioinformatics III Structural Bioinformatics and Genome Analysis. PART II: Genome Analysis. Chapter 7. DNA Microarrays

SIMS2003. Instructors:Rus Yukhananov, Alex Loguinov BWH, Harvard Medical School. Introduction to Microarray Technology.

6. GENE EXPRESSION ANALYSIS MICROARRAYS

Methods of Biomaterials Testing Lesson 3-5. Biochemical Methods - Molecular Biology -

Enhancers mutations that make the original mutant phenotype more extreme. Suppressors mutations that make the original mutant phenotype less extreme

DNA Microarray Technology

Introduction to gene expression microarray data analysis

Human SNP haplotypes. Statistics 246, Spring 2002 Week 15, Lecture 1

Gene Discovery Using Pareto Depth Sampling Distributions

DNA Microarray Data Oligonucleotide Arrays

Soybean Microarrays. An Introduction. By Steve Clough. November Common Microarray platforms

Analysis of Microarray Data

Outline. Array platform considerations: Comparison between the technologies available in microarrays

Introduction to Bioinformatics and Gene Expression Technologies

Please purchase PDFcamp Printer on to remove this watermark. DNA microarray

Introduction to Bioinformatics. Fabian Hoti 6.10.

Exploration and Analysis of DNA Microarray Data

Data Mining and Applications in Genomics

Data Sheet. GeneChip Human Genome U133 Arrays

Technical Review. Real time PCR

Gene Regulation Solutions. Microarrays and Next-Generation Sequencing

Using Low Input of Poly (A) + RNA and Total RNA for Oligonucleotide Microarrays Application

Microarrays: since we use probes we obviously must know the sequences we are looking at!

Microarray Gene Expression Analysis at CNIO

Reliable classification of two-class cancer data using evolutionary algorithms

Multi-objective optimization

Calculation of Spot Reliability Evaluation Scores (SRED) for DNA Microarray Data

Identifying Candidate Informative Genes for Biomarker Prediction of Liver Cancer

measuring gene expression December 5, 2017

DNA Microarrays Introduction Part 2. Todd Lowe BME/BIO 210 April 11, 2007

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

less sensitive than RNA-seq but more robust analysis pipelines expensive but quantitiatve standard but typically not high throughput

Goals of pharmacogenomics

Engineering Genetic Circuits

CS 5984: Application of Basic Clustering Algorithms to Find Expression Modules in Cancer

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

Gene Signal Estimates from Exon Arrays

Outline. Analysis of Microarray Data. Most important design question. General experimental issues

3. human genomics clone genes associated with genetic disorders. 4. many projects generate ordered clones that cover genome

Analysis of Microarray Data

CHAPTER 21 LECTURE SLIDES

Lecture 14 - PCR Applications and Lab Practicum (AMG text pp ) October 9, 2001

Bacterial DNA replication

Bioinformatics Support of Genome Sequencing Projects. Seminar in biology

Preprocessing Affymetrix GeneChip Data. Affymetrix GeneChip Design. Terminology TGTGATGGTGGGGAATGGGTCAGAAGGCCTCCGATGCGCCGATTGAGAAT

Predicting Microarray Signals by Physical Modeling. Josh Deutsch. University of California. Santa Cruz

Introduction to Molecular Biology

Classification and Learning Using Genetic Algorithms

Baum-Welch and HMM applications. November 16, 2017

Typical probes. Slides per pack Aminosilane. Long oligo- Slide AStar None D surface. nucleotides

EECS730: Introduction to Bioinformatics

Class Information. Introduction to Genome Biology and Microarray Technology. Biostatistics Rafael A. Irizarry. Lecture 1

Bio Rad PCR Song Lyrics

American Society of Cytopathology Core Curriculum in Molecular Biology

Chapter 15 Gene Technologies and Human Applications

Multiple-Criteria Gene Screening in Microarray Experiments

Quality Measures for CytoChip Microarrays

Universal Gene Expression Analysis with Combinatorial Arrays

Humboldt Universität zu Berlin. Grundlagen der Bioinformatik SS Microarrays. Lecture

Feature Selection of Gene Expression Data for Cancer Classification: A Review

COMPUTATIONAL ANALYSIS OF MICROARRAY DATA

Applications and Uses. (adapted from Roche RealTime PCR Application Manual)

Quantitative Real Time PCR USING SYBR GREEN

Designing Complex Omics Experiments

Discovery of Transcription Factor Binding Sites with Deep Convolutional Neural Networks

Random Forests. Parametrization and Dynamic Induction

Fatchiyah

Introduction To Real-Time Quantitative PCR (qpcr)

Péter Antal Ádám Arany Bence Bolgár András Gézsi Gergely Hajós Gábor Hullám Péter Marx András Millinghoffer László Poppe Péter Sárközy BIOINFORMATICS

Spotted DNA Array Design. Todd Lowe Bio 210 Jan 13 & 15, 2003

INTRODUCTION TO REVERSE TRANSCRIPTION PCR (RT-PCR) ABCF 2016 BecA-ILRI Hub, Nairobi 21 st September 2016 Roger Pelle Principal Scientist

Agilent Genomic Workbench 7.0

Polymerase Chain Reaction (PCR) and Its Applications

Midterm 1 Results. Midterm 1 Akey/ Fields Median Number of Students. Exam Score

Analysis of a Tiling Regulation Study in Partek Genomics Suite 6.6

Genome-wide association studies (GWAS) Part 1

Measuring transcriptomes with RNA-Seq. BMI/CS 776 Spring 2016 Anthony Gitter

Expression Array System

Molecular Cell Biology - Problem Drill 11: Recombinant DNA

Multiple choice questions (numbers in brackets indicate the number of correct answers)

Functional Bioinformatics of Microarray Data: From Expression to Regulation

Introduction to Real-Time PCR: Basic Principles and Chemistries

BENG 183 Trey Ideker. Genome Assembly and Physical Mapping

7.06 Cell Biology QUIZ #3

European Union Reference Laboratory for Genetically Modified Food and Feed (EURL GMFF)

The Polymerase Chain Reaction. Chapter 6: Background

Measuring transcriptomes with RNA-Seq

Molecular Biology Primer. CptS 580, Computational Genomics, Spring 09

A Greedy Algorithm for Minimizing the Number of Primers in Multiple PCR Experiments

qpcr Quantitative PCR or Real-time PCR Gives a measurement of PCR product at end of each cycle real time

Transcription:

Bioinformatics and Genomics: A New SP Frontier? A. O. Hero University of Michigan - Ann Arbor http://www.eecs.umich.edu/ hero Collaborators: G. Fleury, ESE - Paris S. Yoshida, A. Swaroop UM - Ann Arbor T. Carter, C. Barlow Salk - San Diego 1. Bioinformatics background 2. Gene microarrays Outline 3. Gene clustering and filtering for gene pattern extraction 4. Application: development and aging in retina 1

Figure 1: http://www.accessexcellence.org 2

I. Bioinformatics background Every human cell contains 6 feet of double stranded (ds) DNA This DNA has 3,000,000,000 basepairs representing 50,000-100,000 genes This DNA contains our complete genetic code or genome DNA regulates all cell functions including response to disease, aging and development Gene expression pattern: snapshot of DNA in a cell Gene expression profile: DNA mutation or polymorphism over time Genetic pathways: changes in genetic code accompanying metabolic and functional changes, e.g. disease or aging. Genomics: study of gene expression patterns in a cell or organism 3

Possible Impact Understanding role of genetics in cell function and metabolism Discovering genetic markers and pathways for different diseases Understanding pathogen mechanisms and toxicology studies Development of genotype-specific drugs Development of genetic computing machines In situ genetic monitoring and drug delivery 4

Kellog Sensory Gene Microarray Node: Objectives Establish genetic basis for development, aging, and disease in the retina time 0 months 6 months 12 months 18 months Figure 2: Sample gene trajectories over time. 5

II. Gene Microarrays Two kinds of Shotgun sequencing: 1. GeneChip Oligonucleotide Microarrays (Affymetrix) 2. cdna Microarrays (Stanford) cell cultures microarray transcription image processing gene filtering validation Figure 3: Microarray experiment cycle. 6

Microarray Image Formation Target cdna Microarray Laser scanner PMT CRT Probe cdna Figure 4: Image formation process. 7

Figure 5: Oligonucleotide PM/MM layout (pathbox.wustl.edu). 8

200 400 600 800 1000 1200 1400 1600 200 400 600 800 1000 1200 1400 1600 Figure 6: Affymetrix GeneChip microarray. 9

50 100 150 200 250 300 350 20 40 60 80 100 120 140 160 180 200 220 Figure 7: cdna spotted array. 10

Control Factors Influencing Variability Sample preparation: reagent quality, temperature variations Slide manufacture: slide surface quality, dust deposition Hybridization: sample concentration, wash conditions Cross hybridization: similar but different genes bind to same probe Image formation: scanner saturation, lens aberations, gain settings Imaging and Extraction: spot misalignment, discretization, clutter account for data variability! Scaling factors: universal intensity amplification factor for a chip Raw Q: noise and other random variations of a chip Background: avg of lowest 2% cell intensity values %P: percentage of transcripts present 11

Microarray Signal Extraction 100 200 300 400 500 600 700 800 100 200 300 400 500 600 700 800 900 1000 Figure 8: Blowup of cdna spotted array. 12

Weak spot (7,13) 0.3 10 20 30 0.4 40 50 60 0.5 10 20 30 40 50 60 60 40 20 20 40 60 Figure 9: Weak Spot. 13

Normal spot (4,4) 0.4 0.3 10 20 30 40 50 60 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 10 20 30 40 50 60 60 40 20 20 40 60 Figure 10: Normal spot. 14

Saturated spot (8,6) 0.5 10 20 30 0 40 50 60 0.5 10 20 30 40 50 60 60 40 20 20 40 60 Figure 11: Saturated spot. 15

Morphological Spot Segmentation (Siddiqui&Hero:ICIP02) Figure 12: (L) Original cdna microarray image. (R) after alternating sequential filtering. 16

Figure 13: (L) Final segmentation. (R) Spot watershed domains for noise averaging. 17

Model-based Signal Extraction (Hero:Springer02) λ(x,y) dn(x,y) + h(x,y) g( ) + i(x,y) λ (x,y) o w(x,y) Figure 14: Filtered Poisson model for microarray image. 18

Gabor Superposition - Width MSE RDB1 RDB2 MSE 10 1 10 0 lambda Figure 15: Distortion-rate MSE lower bounds on Gabor widths of Φ j (x;y). 19

Gene Clustering and Filtering (Fleury&etal:ICASSP02) Figure 16: Clustering on the Data Cube. Objective: Classify time trajectory of gene i into one of K classes 20

Gene Trajectory Classification level level level gene j gene i gene i gene j Pn2 M16 M21 Figure 17: Gene i is old dominant while gene j is young dominant Objective: classify gene trajectories from sequence of microarray experiments over time (t) and population (m) θ i (m;t); m = 1;:::;M; t = 1;:::;T 21

Principal approaches: Clustering and filtering Methods Hierarchical clustering (kdb trees, CART, gene shaving) K-means clustering Self organizing (Kohonen) maps Vector support machines Validation approaches: Significance analysis of microarrays (SAM) Bootstrapping cluster analysis Leave-one-out cross-validation Replication (additional gene chip experiments, quantitative PCR) 22

Gene Filtering via Multiobjective Optimization Gene selection criteria for i-th gene ξ 1 (θ i ;:::; ξ ); P (θ i ) Possible ξ p (θ i ) s for finding uncommon genes Squared mean change from t = 1tot = T : ξ 1 (θ i = jθ ) i θ (Λ;1) i )j (Λ;T 2 Standard deviation at t = 1: ξ 2 (θ i ) = θi θ (Λ;1) i (Λ;1) Standard deviation at = t T: ξ 3 (θ i ) = θi (Λ;T ) θ i (Λ;T ) 2 2 23

Some possible scalar functions: t-test statistic (Goss etal 2000): T i = ξ 1 (θ i ) 1 2 ξ 2(θ i )+ 1 2 ξ 3(θ i ) T i 1+T i R 2 statistic (Hastie etal 2000): R 2 i = p ξ 1 (θ i ) ξ2 (θ i )ξ 3 (θ i ) H statistic (Sinha etal 1998): H i = Objective: find genes which maximize or minimize the selection criteria 24

Aggregated Criteria Let fw p g P p=1 be experimenter s cost preference pattern P W p = 1; W i 0 p=1 Find optimal gene via: max i P W p ξ p (θ i ); or max i p=1 P p=1 (ξ p (θ i )) W p Q. What are the set of optimal genes for all preference patterns? A. These are non-dominated genes (Pareto optimal) Defn: Genei is dominated if there is a j 6= i s.t. ξ p (θ i )» ξ p (θ j ); p = 1;:::;P 25

Pareto Optimal Fronts ξ 2 B ξ 2 A C D ξ ξ 1 1 Figure 18: a). Non-dominated property, and b). Pareto optimal fronts, in dual criteria plane. 26

Pareto Gene Filtering vs. Paired T-test 10 6 α = 0.1 α = 0.5 DISTANCE BETWEEN CLASSES 10 4 10 2 10 0 10 2 10 4 10 1 10 2 10 3 10 4 10 5 10 6 10 7 DISTANCE INSIDE CLASSES Figure 19: ξ 1 = mean change vs ξ 2 = pooled standard deviation for 8826 mouse retina genes. Superimposed are T-test boundaries 27

10 6 DISTANCE BETWEEN CLASSES 10 4 10 2 10 0 10 2 10 4 10 1 10 2 10 3 DISTANCE 10 4 INSIDE 10 5 CLASSES 10 6 10 7 Figure 20: First (circle) second (square) and third (hexagon) Pareto optimal fronts. 28

Application: Development and Aging in Mouse Retina Mouse Retina Experiment: Retinas of 24 mice sampled and hybridized 6 time points: Pn2, Pn10, M2, M6, M16, M21 4 mice per time sample Affymetrix GeneChip layout with 12422 poly-nucleotides Affymetrix attribute analyzed: AvgDiff Used Affymetrix filter to eliminate all genes labeled A Objective: Find interesting gene trajectories within the set of remaining 8826 genes 29

Figure 21: 4 candidate gene profiles from Mus musculus 5 0 86632) end cdna (Unigene 30

Multi-objective Non-parametric Pareto Filtering Define trend vector: ψ i = [b 1 ;:::;b 6 ], b i 2 f0;1g Old dominant filtering criteria: high mean slope from t = Pn1tot = M21 ξ 1 (ψ i ) = b i (Λ;Λ) high consistency over 6 4 = 4096 possible combinations of trajectories ξ 2 (ψ # trajectories having ψ i = [1;:::;1] i ) = 4096 31

Occurence Histogram 3000 2500 number of valid trajectories 2000 1500 1000 500 0 1 2 3 4 5 6 7 8 9 10 gene name x 10 4 Figure 22: Monotonicity occurrence histogram with threshold. 32

Old Dominant Pareto Fronts Old Dominant Fronts Valid Trajectories Mean Slope Figure 23: Pareto fronts for old dominant genes. 33

Resistant Old Dominant Genes in first Three Fronts Leave-one-out cross validation Let ψ i m denote one possible set of T 1) =6 (M 3 samples Cross-validation Algorithm: Do m = 1;:::;4 6 : Compute ξ 1 (ψ m ); i ξ 2 (ψ i m ) Find Genes in First 3 Pareto fronts: G m End 34

Three-objective Pareto Filtering Objective Extract aging genes Strictly increasing filtering criteria: persistent positive trend ξ 1 (ψ i = min ) b i (Λ;t) = max t high consistency over 6 4 = 4096 possible combinations of trajectories no plateau ξ 2 (ψ # trajectories having ψ i = [1;:::;1] i ) == 4096 = max ξ 3 (θ i ) = [θ i (Λ;t + 1) 2θ i (Λ;t)+θ i (Λ;t 1)] 2 = min 35

Pareto Fronts Figure 24: First global Pareto front (o) for the three criteria (ξ 1, ξ 2 and ξ 3 ). 36

Conclusions 1. Signal processing has a role to play in many aspects of genomics 2. Careful physical modeling of image formation process can yield performance gains 3. New methods of data mining are needed to perform robust and flexible gene filtering 4. Cross-validation is needed to account for statistical sampling uncertainty 5. Joint intensity extraction and gene filtering? 6. Optimization algorithms for large data sets? 7. Genetic priors: phylogenetic trees, BLAST database, etc? 37