Introduction to Microarray Technique, Data Analysis, Databases Maryam Abedi PhD student of Medical Genetics

Size: px
Start display at page:

Download "Introduction to Microarray Technique, Data Analysis, Databases Maryam Abedi PhD student of Medical Genetics"

Transcription

1 Introduction to Microarray Technique, Data Analysis, Databases Maryam Abedi PhD student of Medical Genetics

2 Outlines Technology Basic concepts Data analysis Printed Microarrays In Situ-Synthesized Oligonucleotide High-Density Bead Arrays Databases and software

3 What is a microarray? Microarray is a hybridization of biological material (target) to a very large set of probes, which are attached to a solid support. Ref: GeneChip Microarray Curriculum 2005 Version Microarray /Technology

4 How does it work? Microarray /Technology Ref: GeneChip Microarray Curriculum 2005 Version

5 Two-color microarrays or two-channel microarrays Ref: ( Microarray /Technology

6 Single-channel microarrays or one-color microarrays Ref: ( Microarray /Technology

7 ONE-channel vs. TWO-channel detection I. An aberrant sample cannot affect the raw data derived from other samples. II. III. Data are more easily compared to arrays from different experiments so long as batch effects have been accounted for. The one-color system is that, when compared to the two-color system, twice as many microarrays are needed to compare samples within an experiment. Microarray /Technology

8 Some applications Gene expression profiling SNP detection Clinical microbiology Drug Discovery Microarray /Technology

9 Outlines Technology Basic concepts Data analysis Printed Microarrays In Situ-Synthesized Oligonucleotide High-Density Bead Arrays Databases and software

10 Ref: Melissa B. Miller.2009 Workflow summary of printed microarrays

11 Affymetrix GeneChip oligonucleotide microarray Ref: Melissa B. Miller.2009

12 High-Density Bead Arrays Ref:

13 Outlines Technology Basic concepts Data analysis Printed Microarrays In Situ-Synthesized Oligonucleotide High-Density Bead Arrays Databases and software

14 Basic Data Analysis Steps: Normalization: The process of removing (or minimizing) nonbiological variation in measured signal intensity levels so that biological differences in gene expression can be appropriately detected.

15 There are four types of internal controls that can be used for normalization 1. Housekeeping genes 2. Random cdna sequences 3. cdna sequences from an unrelated organism 4. Positive spike-in controls

16 Normalization Methods: 1. Total intensity normalization 2. Median centering 3. Quantile normalization 4. Lowess normalization.

17 Basic Data Analysis Steps: Hypothesis-driven statistical analysis: Identification of statistically significant changes in gene expression are commonly identified using the t-test, ANOVA to microarray data sets

18 Basic Data Analysis Steps: Hypothesis-driven statistical analysis: Microarrays have multiple comparison problem p <= 0.05 says that 95% confidence means are different; therefore 5% due to chance. If 10,000 genes are tested, 5% or 500 genes might be called significant by chance alone!

19 Basic Data Analysis Steps: Hypothesis-driven statistical analysis: Adjusted p-value(adj.p-value): Four types of multiple testing corrections: The most stringent Ref:Multiple Testing Corrections/Agilent Technologies, Inc. 2005

20 Basic Data Analysis Steps: Hypothesis-driven statistical analysis: Fold Change(FC): Fold change is often used in analysis of gene expression data in microarray for measuring change in the expression level of a gene. Test Control

21 Basic Data Analysis Steps: Hypothesis-driven statistical analysis: Fold Change(FC): Although ratios provide an intuitive measure of expression changes, they have the disadvantage of treating up- and down-regulated genes differently. Up-regulated Gene: T=4/2=2 Down-regulated Gene: T=2/4=0.5-1 Down 0 1 Up +

22 Basic Data Analysis Steps: Hypothesis-driven statistical analysis: Log 2 FC: producing a continuous spectrum of values and treating up- and down-regulated genes in a similar fashion. log2(ratio): log2(2) = 1 log2(1 2) = 1 - Down Up

23 Basic Data Analysis Steps: Analysis questions: Bioinformatics: Pathway Analysis Our selected genes have role in which pathways Class Discovery: Clustering Within the tumor samples, are there subgroups that have a specific expression profile?

24 Outlines: Technology Basic concepts Data analysis Printed Microarrays In Situ-Synthesized Oligonucleotide High-Density Bead Arrays Database and software

25 GEO database stores high-throughput functional genomic data : Gene Expression Omnibus (GEO) ( GEO Profiles: This database stores individual gene expression profiles GEO DataSets: This database stores curated gene expression DataSets Microarray /Database

26 Some popular databases for pathway analysis : Database for Annotation, Visualization and Integrated Discovery (DAVID ) : ( ) Search Tool for the Retrieval of Interacting Genes(STRING) (

27 Popular software for microarray analysis / : ( Commands: source(" bioclite() : a Statistical Utility for Microarray and Omics data ( Microarray /Software

28 Thanks for your attention

29 Sources of Non-Biological Variation Dye bias: differences in heat and light sensitivity, efficiency of dye incorporation Differences in the amount of labeled cdna hybridized to each channel in a microarray experiment Variation across replicate slides Variation across hybridization conditions Variation in scanning conditions.

30 STRING is a database of known and predicted protein interactions

31 Normalization Methods: Gene Median centering A Simple Example Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy

32 Determine Channel Medians Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy medians

33 Subtract Channel Medians Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy This is the data after median centering.

34 Before Median centering Slide 1 Cy3 Cy5 Slide 2 Cy3 Cy5 maximum Q3=75 th percentile median Q1=25 th percentile minimum

35 Note that medians match but variation seems to differ greatly across channels Log Mean Signal Centered at 0

36 Normalization Methods: Quantile normalization Quantile normalization is most commonly used in normalization of Affymetrix data It can be used for two-color data as well Quantile normalization can force each channel to have the same quantiles

37 Normalization Methods: Quantile normalization A Simple Example Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy

38 Find the Smallest Value for Each Channel Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy

39 Average These Values Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy ( )/4=3.25

40 Replace Each Value by the Average Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy ( )/4=3.25

41 Find the Next Smallest Values Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy

42 Average These Values Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy ( )/4=5.5

43 Replace Each Value by the Average Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy

44 Find the Average of the Next Smallest Values Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy ( )/4=7.5

45 Replace Each Value by the Average Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy

46 Find the Average of the Next Smallest Values Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy ( )/4=10.25

47 Replace Each Value by the Average Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy

48 Find the Average of the Next Smallest Values Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy ( )/4=12.00

49 Replace Each Value by the Average Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy This is the data matrix after quantile normalization.

50 Bonferroni correction: Corrected P-value= p-value * n (number of genes in test) <0.05 N=1000 genes P-value = Corrected P-value= *1000=0.04 <0.05

51 Bonferroni Step-down (Holm) correction: Ref:Multiple Testing Corrections/Agilent Technologies, Inc. 2005

52 Benjamini and Hochberg False Discovery Rate: Ref:Multiple Testing Corrections/Agilent Technologies, Inc. 2005