STATC 141 Spring 2005, April 5 th Lecture notes on Affymetrix arrays Materials are from http://www.ohsu.edu/gmsr/amc/amc_technology.html The GeneChip high-density oligonucleotide arrays are fabricated by using in-situ synthesis of short oligonucleotide sequences on a small glass chip using light directed synthesis. This technique allows for the precise construction of a highly ordered matrix of DNA oligomers on the chip. In the GeneChip system a known gene or potentially expressed sequence is represented on the chip by 11-20 unique oligomeric probes, each 25 bases in length. The group of probes corresponding to a given gene or small group of highly similar genes is known as the probe set and generally spans a region of about 600 bases, known as the target sequence. Many copies of each oligomer are synthesized in discrete features (or cells) on the GeneChip array. In addition, for each oligomer on the array there is a matched oligomer, synthesized in an adjacent cell that is identical with the exception of a mismatched base at the central position (i.e. base 13). These are designated Perfect Match (PM) and Mismatch (MM) probes, respectively. The MM probes serves as a control for non-specific hybridization.
Appendix (optional) Assay Overview The GeneChip arrays are scanned and the images processed using Affymetrix software, Microarray Suite (MAS 5.0). For more information on the GeneChip expression assay, please see http://www.affymetrix.com/support/technical/manual/expression_manual.affx
Data Overview Affymetrix GeneChip experiments are managed with the Affymetrix Microarray Suite (MAS 5.0) software. The MAS software interfaces with equipment to run a probe array experiment and is also used to generate preliminary analysis data from an experiment. Below we cover the basics of files generated by MAS 5.0 and also explain some of the most widely used variables generated by MAS. MAS File Types There are five file types that MAS 5.0 generates during the process of a GeneChip Array experiment. They are as follows: Experiment File *.EXP: This file contains the parameters of the experiment such as Probe Array Type, Experiment Name, Equipment parameters, Sample Description, and others. This file is not used for analysis, but is required to open other MAS files for the designated chip experiment. Image Data File *.DAT: This file is the image file generated by the scanner from the Probe Array after processing on the Fluidics Station. This file can be viewed in MAS 5.0 or exported as a *.TIFF image. This file is used in MAS 5.0 to generate the *.CEL file (see below). Cell Intensity File *.CEL: The cell file contains the processed cell intensities from the primary image in the *.DAT file. The cell file can be viewed in MAS 5.0, but cannot be exported. The cell file is used by MAS 5.0 to generate the *.CHP file, which contains the numerical data from the *.DAT, and *.CEL files. Probe Array Results File *.CHP: The chip file is the output file from the MAS expression analysis of the Probe Array. The chip file contains the data that will be used for statistical analysis and data mining analysis. Report File *.RPT: The report file is generated from the chip file. This expression report summarizes information about expression analysis settings and probe set hybridization intensity data. MAS Analysis Metrics Signal: a measure of the abundance of transcript Detection: the call that indicates whether the transcript is detected (P present), undetected ( A, absent), or at the limit of detection (M, marginal). Detection p-value: p-value that indicates the significance of the detection call. Signal Log Ratio: the change in expression level of a transcript between a baseline and an experiment array. This change is expressed as the log2 ratio. A log2 ratio of 1 is equal to a fold change of 2. Change: the call that indicates the change in the transcript level between a baseline and experiment (increase (I), marginal increase (MI), no change (NC), marginal decrease (MD), decrease (D)). Change p-value: p-value that indicates the significance of the change call. Each probe set on a GeneChip array has a unigue name known as the Probe set ID. Probe set ID's have different extensions that denote important information about how the probe set was designed.. The nomenclature for the probe set extensions are below.
Probe Set Extension Nomenclature All probe sets have one of the following two extensions: _at : anti-sense target (most probe sets on the array) _st : sense target (only some control probes are in sense orientation on the array) A few probe sets are designated as follows: _i : reduced number of pairs in the probe set. Some probe sets represent more than one gene or EST: _s_at : designates probe sets that share common probes among multiple transcripts from different genes. _a_at : designates probe sets that recognize multiple alternative transcripts from the same gene (on HG-U133 these probe sets have an "_s" suffix). _x_at : designates probe sets where it was not possible to select either a unique probe set or a probe set with identical probes among multiple transcripts. Rules for cross-hybridization were dropped. Therefore, these probe sets may cross-hybridize in an unpredictable manner with other sequences. _g_at : similar genes, also unique probe sets elswhere on the array. _f_at : similarity rules dropped, probe set will recognize more than one gene. _i_at : designates sequences for which there are fewer than the required numbers of unique probes specified in the design. _b_at : all probe selection rules were ignored. Withdrawn from GenBank. _l_at : sequence represented by more than 20 probe pairs. _r_ : designates sequences for which it was not possible to pick a full set of unique probes using Affymetrix probe selection rules. Probes were picked after dropping some of the selection rules. Most of the descriptions for the probe set ID extensions above were taken from the Affymetrix GeneChip Expression Analysis Data Analysis Fundamentals. Glossary of Analysis Terms Target: Fragmented, biotinylated anti-sense crna prepared from mrna to be analyzed. Target molecules are hybridized to the probe array and the levels of hybridization are measured with the GeneArray scanner after the array is stained with streptavidin-phycoerythrin (SAPE). Probe: Single-stranded DNA oligonucleotide synthesized directly on the surface of the GeneChip array using photolithography and combinatorial chemistry. The 25 base oligonucleotide is designed to be complementary to a specific gene transcript. Probe Cell: Single square-shaped feature on an array containing probes with a unique sequence. The size can vary depending on the array type, typically 20 µm or 18 µm. Each probe cell contains millions of probe molecules. Perfect Match (PM): Probes that are designed to be complementary to a reference sequence. Mismatch (MM): Probes that are designed to be complementary to a reference sequence except for a homomeric mismatch at the central position (e.g., 13th position of 25 base probe. A->T or G->C). Mismatch probes serve as a control for cross-hybridization. Probe Pair: Two probe cells, a PM and its corresponding MM. On the probe array, a probe pair is arranged with
a PM cell directly above a MM cell. Probe set: A set of probes designed to detect one transcript. A probe set usually consists of 11-20 probe pairs. For example, an 11 probe pair set is made up of 11 PM probes and 11 MM probes for a total of 22 probe cells. Newer array designs from Affymetrix, e.g., HG-U133, contain probe sets with 11 probe pairs. Older designs have average probe set numbers of 16 or 20 probe pairs. Target Sequence: The portion of a transcript reference sequence that is interrogated by a probe set on the array. The target sequence extends from the first base of the most 5 probe to the last base of the most 3 probe. Absolute Analysis: This is an analysis of a single GeneChip array using Affymetrix Microarray Suite software. The software applies an algorithm developed by Affymetrix to determine the expression level for each gene represented on the array. Analysis Metrics: Probe set performance descriptors calculated by the software from measured probe cell intensities. Analysis metrics are used to determine biologically meaningful results, such as the presence or absence of gene transcripts. Analysis Parameters: Variables with user-defined values used in the expression analysis (default values in the software are empirically determined at Affymetrix). *More extensive glossaries can be found in Statistical Algorithms Reference Guide and Data Analysis Fundamentals, available on the Affymetrix website (www.affymetrix.com).