CS-E5870 High-Throughput Bioinformatics Microarray data analysis

Size: px

Start display at page:

Download "CS-E5870 High-Throughput Bioinformatics Microarray data analysis"

Jody Harris
5 years ago
Views:

1 CS-E5870 High-Throughput Bioinformatics Microarray data analysis Harri Lähdesmäki Department of Computer Science Aalto University September 20, 2016 Acknowledgement for J Salojärvi and E Czeizler for the previous version of the slides Contents Microarray technology Microarray gene expression analysis Microarray genotyping

2 Gene expression analysis To get started, lets assume we are interested in identifying genes which are turned on or off in different conditions For that purpose we want to measure mrna levels and quantify up/down regulation Microarray technology Based on the principle of hybridization A set of gene probes is immobilized on a solid surface Probe = single-stranded DNA: a part of the coding region of a specific gene Figure from (Butte, 2002)

Microarray technology An mrna sample is taken from a cell population of interest Produce cdnas/crnas (complementary copies of mrnas) and label them with fluorescent dye Labeled cdnas

3 Microarray technology An mrna sample is taken from a cell population of interest Produce cdnas/crnas (complementary copies of mrnas) and label them with fluorescent dye Labeled cdnas hybridize with complementary DNA molecules on surface Fluorescence signals are detected Figure from (Butte, 2002) Different types of microarrays Spotted cdna microarrays and oligonucleotide microarrays

4 Classification of microarray technologies Classification according to sample preparation: Two-color microarrays: two samples per array, comparative One-color microarrays: one sample per array, quantitative Classification according to fabrication technology: Spotted microarrays The probes are oligonucleotides, cdna or small fragments of PCR products that correspond to mrnas The probes are first synthesized and then they are spotted on the array surface In-situ synthesized oligonucleotide microarrays The probes are short oligonucleotide sequences representing a single gene or family of gene splice-variants The probes are synthesized directly onto the array surface In this course: Affymetrix microarrays (in-situ, one-color) Affymetrix arrays Up to a few millions of probe cells, each containing several hundred thousands of a specific probe Each transcript is measured by different 25-mer probe-pairs (probe set) Figure from

Affymetrix arrays Each probe-pair consists of a perfect match (PM) probe and a mis-match (MM) probe (MM is identical with PM probe except on the middle base which is substituted with its complement)

5 Affymetrix arrays Each probe-pair consists of a perfect match (PM) probe and a mis-match (MM) probe (MM is identical with PM probe except on the middle base which is substituted with its complement) PM probe is used to measure specific hybridizations MM probe measures non-specific hybridization Figure from Affymetrix arrays Probes selected from the 600 bases most proximal to the 3 end of each transcript PM Probe = 25 bp probe perfectly complementary to a specific region of a gene MM Probe = 25 bp probe agreeing with a PM apart from the middle base Mismatches were designed to capture non-specific hybridization Note: mutations can cause specific binding of mismatch probes Figure from

6 Contents Microarray technology Microarray gene expression analysis Microarray genotyping Steps of a microarray experiment 1 Experimental design 2 Biological experiment, collecting samples 3 Extraction of RNA, amplification, labeling, hybridization and washing 4 Scanning and image analysis 5 Quality control 6 Preprocessing and normalization 7 Statistical analysis

7 Experimental design Classical experimental design considerations: Enough replicates (at least 3) Sensible controls (depending on question of interest) Blocking/randomization when performing treatments Experimental design: Replication Replication = Repeating an experiment multiple times The purpose of replication is to estimate the variability of outcomes within classes, which can come from biological variation between individuals or from measurement error/noise This is needed for testing whether there are statistically significant differences between classes Tradeoff: more replicates give you more statistical power, but replication costs time and money It is important to do the right kind of replication

8 Experimental design: Replication Biological replication: the same treatment for different individuals Considers population-level variation in the data (difference between individuals) Technical replication: repeat the measurement procedure for the same biological sample Estimates measurement-based variation in the data Biological variation is interesting, technical variation is (unwanted) noise For class comparisons, it is generally recommended to focus on biological replicates since their outcome yields at the same time variability from population heterogeneity and from measurement errors Choosing the best strategy strongly depends on the exact research question and the availability of biological samples Experimental design: Confounding effects Classical example of confounding experimental design: Question: compare amount of corn produced from two varieties V1 and V2 Plan: put seed V1 on land area A1, V2 on land area A2 This is bad! it is not possible to separate variety effect from area effect (eg, soil fertility, weather, ): which one is the cause for a higher yield? Solution: for A1, put V1 on one half and V2 on the other half; same for A2 Tools to prevent confounding effects: blocking and randomization Blocking and randomization are general experimental design concepts and apply more generally

9 Experimental design: Blocking Assume we wish to perform an experiment to compare two treatments The samples or their treatments may not be homogeneous: there are blocks Subjects: Male/Female Arrays produced in two lots (February, March) If there are systematic differences between the considered classes, then the effects of interest (eg treatment) may be confounded Observed differences are attributable to treatment effect or to confounding factors Local control or blocking is the way to minimize the effect of existing (unavoidable?) blocks For the blocking, use independent input variables where the values are known or can be monitored/controlled: eg, gender, age, machine used for the measurement, dye used for labeling, Experimental design: Blocking Example: Hemoglobin concentration of patients with/without iron treatment Treatment No treatment Hb 145,150,153, concentration 130,127, ,120,130, 100,105,112 The gender is known to have an effect on hemoglobin concentration, but data above do not record the gender for the measurements So we cannot be sure which differences in Hb concentrations are due to gender and which are due to treatment With blocking Treatment No treatment Blocking Men Women 145,150, ,120, ,127, ,105,112

10 Experimental design: Randomization Block what you can, randomize what you cannot Random allocation of the experimental units across the treatment groups: eg, randomize which men belong to treatment group and which men belong to control group Breaks the (unknown) dependencies in the data For example, eating habits could also influence Hb concentration, but they may be harded to monitor So we hope that randomization produces groups that have similar distributions with respect to the unobservable factors Sample collection, Hybridation, Scanning 2 Sample collection Cell types in the sample cannot be separated afterwards: use as pure samples as possible The fast degradation of mrna requires proper guidance for the lab personnel: procedures for homogenization, centrifugation, etc Cell lines, model organisms: standardized growth conditions 3 Hybridation: follow manufacturer s protocols 4 Scanning and image quantification: manufacturer has automated and standardized protocols/methods

11 Quality control of Affymetrix arrays Many quality control and normalization methods assume and thus force similarity of arrays Density plots, boxplots Affymetrix quality measures: RNA digestion plots Density Plots, Boxplots Compare intensity distributions across microarrays they should look similar Each box (density line) shows the distribution of expression values of one array

Pre-processing & Normalization: Robust Multi-array Average (RMA) (Irizarry et al, 2003) 1 Background correction PM data is combination of background and signal PM = Signal + Background Background

12 Pre-processing & Normalization: Robust Multi-array Average (RMA) (Irizarry et al, 2003) 1 Background correction PM data is combination of background and signal PM = Signal + Background Background correction is performed on each array separately assuming Signal exp(λ) and Background N(µ, σ 2 ) Parameters λ, µ and σ 2 are assumed to be shared across PM probes and thus can be estimated from data Consequently, a closed-form solution for the expected value of the background corrected signal is obtained Figure from Pre-processing & Normalization: Robust Multi-array Average (RMA) (Irizarry et al, 2003) 2 Normalization (across arrays): Make probe intensity distributions the same for all arrays Quantile Normalization is used to correct for array biases (compares the expression levels between arrays for various quantiles) 3 Summarize probes only after that, using all arrays rma is implemented in the Bioconductor package affy

13 RMA: Quantile normalization After background subtraction, take log 2 (PM) of the data Then carry out quantile normalization on probe level: Given n arrays each quantifying p intensities, form matrix X of dimension p-by-n, where each array is a column Sort each column of X to get X sort Take the mean across rows of Xsort Assign this mean to each element in the row to get quantile equalized X sort Rearrange each column of X sort to have the same ordering as the original matrix X to obtain X normalized There are in the order of millions of intensities on one array RMA: Quantile normalization Arrays 1 to 3, genes A to D A B C D For each column determine a rank from lowest to highest and assign number i-iv A iv iii i B i i ii C ii iii iii D iii ii iv These rank values are set aside to use later Go back to the first set of data Rearrange that first set of column values so each column is in order going lowest to highest value (First column consists of 5,2,3,4 This is rearranged to 2,3,4,5 Second Column 4,1,4,2 is rearranged to 1,2,4,4, and column 3 consisting of 3,4,6,8 stays the same because it is already in order from lowest to highest value) The result is: A becomes A B becomes B C becomes C D becomes D Toy example from Wikipedia

14 RMA: Quantile normalization Toy example from Wikipedia RMA: Quantile normalization Example from Wikipedia

15 RMA: Probe set summarization Once the probe-level PM values have been background-corrected and normalized, they need to be summarized into expression measures One expression value for each probe-set on each array The used summarization assumes that observed log-transformed PM values y ijn follow a linear additive model containing a probe affinity effect α jn, a gene specific effect (the expression level) µ in and an error term ϵ ijn : y ijn = µ in + α jn + ϵ ijn Indices: i arrays, j probes in probe set, n probe sets For identifiability of the parameters: j α jn = 0 The estimate of µ in gives the expression measure for probe set n on array i Instead of the standard ML/least squares estimates, so-called median-polish robust estimation procedure is used in practice Statistical analysis Preprocessed, log-transformed gene expression from microarray experiment is typically considered to be approximately normally distributed This naturally motivates the use of various t-tests and other linear models that make the normality assumption In principle t-test is well suited to quantify differential expression and we demonstrated that in our previous lecture But more advanced methods have been proposed: we will briefly describe two such methods Significance analysis of microarrays (SAM) (Tusher et al, 2001) Linear models for microarray data (limma) (Smyth et al)

16 Significance analysis of microarrays (SAM) Instead of the standard t-statistic, SAM uses the following score for the ith gene d i = x y s + s 0, where s 0 is a so-called interchangeability factor that can help regularizing the variance estimate if s is small and helps sharing information across all genes The implementation of SAM comes with an extensive permutation strategy to estimate false discovery rate (we skip the details) Linear models for microarray data (limma) A key concept of limma is that, for each gene, it fits a linear model to the data limma can be applied to quantify differential expression between two groups, but it also supports many complex experiments limma uses empirical Bayes and other shrinkage methods to borrow information across genes and thus makes the analyses stable even for experiments with very small number of arrays/experiments

17 Linear models for microarray data (limma) limma assumes a linear model y j = X α j + ϵ, where y j contains the expression values for the jth gene across experiments, X is the design matrix, α contains the linear model parameters, and ϵ is normally distributed noise term An example of a design matrix for the two class comparison we considered before looks like this (for two replicates) y j1 y j2 y j3 y j4 = ( αj1 α j2 ) Linear models for microarray data (limma) The contrasts of interest are given by β j = C T α j, where C is the contrasts matrix For the two class comparison, the contrasts matrix for differential expression between the two classes is C T = ( 1 1 )

18 Linear models for microarray data (limma) The empirical Bayes estimate of the variance is s 2 = f 0s f js 2 j f 0 + f j, where f 0 and f j are the prior and residual degrees of freedom for jth gene The moderated t-statistic is then t j = β j u j s 2 j The additional degrees of freedom f 0 represent the extra information which is borrowed from the ensemble of genes for inference about each individual gene Exploring and visualizing differential expression results Volcano plot is a scatter plot which plots the significance of a test (y-axis) as a function of average difference between groups (x-axis) For example: y-axis: log 10 (p) (p stands for p-value of a test) x-axis: log2 (Fold change)

19 Exploring and visualizing differential expression results A comparison of the standard t-test and limma results on a spike-in data set using volcano plot Figure from Contents Microarray technology Microarray gene expression analysis Microarray genotyping

20 Single-nucleotide polymorphism A single-nucleotide polymorphism (SNP) is a variation in a single nucleotide at a specific position in the genome Variation is expected to be present to some degree within a population (eg > 1%) Eg: at a specific base position in the human genome, the base C may appear in most individuals, but in a minority of individuals, the position is occupied by base A Such a variation defines a SNP at this specific position SNPs underlie differences in our genetic susceptibility to diseases Single-nucleotide polymorphism An illustration of a SNP and types of SNPs A T C G A T C G 1 2 SNP Figures from

Microarray genotyping Affymetrix GeneChips can also be employed to genotype SNPs Figure below (taken from (Schwender, 2007) illustrates that each SNP is typically represented by a set of ten probe

21 Microarray genotyping Affymetrix GeneChips can also be employed to genotype SNPs Figure below (taken from (Schwender, 2007) illustrates that each SNP is typically represented by a set of ten probe quartets each consisting of one PM and one MM for each sequence alternative Figure from (Schwender, 2007) Microarray genotyping RLMM (Robust Linear Model with Mahalanobis distance) quantifies the genotype using the RMA processed signals from microarrays The signals corresponding to the two variants A and B, y A and y B, are preprocessed as in the case of gene expression quantification Two allele-specific intensity signals are computed for each SNP β = ( ) ( ) T T β A β B = y A y A +y B y B y A +y B The correct genotype c (c {AA, AB, BB}), is inferred by minimizing the squared Mahalanobis distances d 2 c =(β µ c ) T S 1 c (β µ c ), where the mean µ c and covariance S c for all c are estimated from some test data

22 References Butte A The use and analysis of microarray data Nat Rev Drug Discov 2002 Dec;1(12): Irizarry RA et al Exploration, normalization, and summaries of high density oligonucleotide array probe level data Biostatistics 4(2):249-64, 2003 Holger Schwender Statistical analysis of genotype and gene expression data PhD thesis, University of Dortmund, 2007 Gordon K Smyth, Matthew Ritchie, Natalie Thorne, James Wettenhall and Wei Shi limma: Linear Models for Microarray Data Users Guide Virginia Goss Tusher, Robert Tibshirani, and Gilbert Chu, Significance analysis of microarrays applied to the ionizing radiation response, Proc Natl Acad Sci U S A 2001 Apr 24; 98(9):