Original summary generated in 2003 Updated in 2007

Size: px

Start display at page:

Download "Original summary generated in 2003 Updated in 2007"

Chester Sharp
6 years ago
Views:

1 AFFYMETRIX ARABIDOPSIS GENOME ARRAY DESIGN, ANNOTATION, AND ANALYSIS Original summary generated in 2003 Updated in 2007 Brandon Le Javier Wagmaister Anhthu Bui Arabidopsis Array Annotation 1

2 TABLE OF CONTENTS I. Array Information & Design 3 A. Common Definition & Terminology 3 B. Probe Set Nomenclature 4 II. Annotation of the ATGENOME1 Array (2001) 6 A. Array Information 6 B. Array Features 6 C. Array Annotation Strategy 7 D. Array Annotation Summary 8 III. Annotation of the ATH1 Array (2003) 11 A. Array Information 11 B. Array Features 11 C. Array Annotation Strategy 12 D. Array Annotation Summary 13 IV. Comparison of the ATGENOME1 and ATH1 Arrays 17 A. Comparison of Array Features 17 B. Mapping of Probe Sets Across the Two Arrays 17 V. Re-Annotation of the ATH1 Array (2007) 19 A. Motivation for Re-Annotation Efforts 19 B. Array Re-Annotation Strategy 19 C. Array Re-Annotation Summary 21 D. Comparison of Old and New Annotations 25 Arabidopsis Array Annotation 2

I. ARRAY INFORMATION & DESIGN This section contains information about the design of the GeneChip array and general interpretation of features on the array. A. Common Definition & Terminology Figure 1.

3 I. ARRAY INFORMATION & DESIGN This section contains information about the design of the GeneChip array and general interpretation of features on the array. A. Common Definition & Terminology Figure 1. Affymetrix GeneChip Probe Design. A. GeneChip terminology. B. Schematic of probe design. Probe: A single stranded DNA oligonucleotide designed to match a specific mrna sequence. GeneChip probe arrays use oligonucleotide probes that are up to 25 bases long (Figure 1). The probes are synthesized directly on the surface of the array using photolithography and combinatorial chemistry. Probe Cell: A single square-shaped feature on an array containing one type of probe. The size can vary depending on the array type, but the AtGenome1 (8K) and ATH1 (22K) array contains 24 and 18 µm probe cells, respectively. Each probe cell contains millions of probe molecules representing a unique gene-specific 25-mer oligo (Figure 1). Perfect Match (PM): Probes that are designed to be exactly the same as the reference sequence (Figure 1). Mismatch (MM): Probes that are designed to be exactly the same as the reference sequence except for a homomeric mismatch at the central position. Mismatch probes serve as a control for cross-hybridization (Figure 1). Arabidopsis Array Annotation 3

4 Probe Pair: Consists of two probe cells, a PM and the corresponding MM probe cells. On the array, a probe pair is arranged with a PM cell directly above the MM cell (Figure 1). Probe Set: A set of probes designed to detect one transcript. A probe set usually consists of probe pairs. The AtGenome1 (8K) and ATH1 (22K) array contains probe sets with 16 and 11 probe pairs, respectively (Figure 1). B. Probe Set Nomenclature Each probe set is assigned an Affymetrix serial number ID (five and six digits for the AtGenome1 and ATH1 array, respectively) (see below). IDs containing AFFX represents control features on the array. Each ID is followed by suffixes that denotes the type of sequence each probe set represents. Detailed description of each suffixes is presented in Table 1 and Figure 2. Table 1. Description of probe set suffixes AFFYMETRIX SERIAL NUMBER ID_SUFFIXES S U F F I X E S A R R A Y * D E S C R I P T I O N _at 8K & 22K A unique probe set designed to be complementary to the crna target. ALL probe sets on the array have an _at designation. _st 8K Probe sets (obsolete in the 22K) _s_at 8K (Similarity) Probe sets that corresponds to a small number of unique genes that share identical sequence. Probes were chosen from regions that are common to these genes. Group members can be singleton or group of sequences. The sequences that make up these probe sets are BAC sequences with corresponding mrna sequences or overlapping BAC sequences. The sequences are IDENTICAL. _g_at (obsolete in the 22K) _f_at (obsolete in the 22K) 8K 8K (Group) Probes chosen in region of overlap. Sequences are represented as singletons on the same probe array. These probe sets have sequence with overlapping concensus. (Family) Probe sets that corresponds to sequences for which it was not possible to pick a full set of 16 unique and/or shared-similarity-constrained probes. Some probes are similar but not necessarily identical to other gene sequences. Arabidopsis Array Annotation 4

S U F F I X E S _i_at (obsolete in the 22K) _r_at (obsolete in the 22K) A R R A Y * D E S C R I P T I O N 8K (Incomplete) Sequences for which there are fewer than 15 8K unique probes.

5 S U F F I X E S _i_at (obsolete in the 22K) _r_at (obsolete in the 22K) A R R A Y * D E S C R I P T I O N 8K (Incomplete) Sequences for which there are fewer than 15 8K unique probes. (Rules Dropped) Sequences for which it was not possible to pick a full set of unique probes using Affymetrix probe rules. Probes were picked by dropping some of these rules. _a_at 22K A probe set that recognizes alternative transcripts from the same gene (Figure 3). _s_at 22K A probe set with all probes common among multiple transcripts within a gene family (these probe sets can detect members of a gene family - Figure 3). _x_at 22K A probe sets with some probes that are identical, or highly similar, to unrelated sequences. These probes may crosshybridize in an unpredictable manner with sequences other than the main target. Data generated from these probe sets should be interpreted with caution, due to the likelihood that some of the signal is from transcripts other than the one being intentionally measured. * 8K refers to the first generation Arabidopsis array (AtGenome1) representing ~8,000 genes. 22K refers to the second generation Arabidopsis array (ATH ) representing ~25,000 genes (commonly referred as the whole genome array. Figure 2. Possible Relationships Between Sequences And Probe Sets on an Array. G1, G2, etc., represent families of related gene sequences; S1, S2, S3, etc., represent individual gene sequences; PS1, PS2, PS3, etc., represent probe sets; P1, P2, P3, etc., represent probes. Solid lines show how sequence sets or sequences are related to probe sets or individual probes. If a solid line is connected to a box, the relationship applies to all probes in that box. If it is connected to an individual circle, the relationship holds only for that probe. (Figure and figure legend taken from Redman et al., Plant J 38(3) (2004)). Arabidopsis Array Annotation 5

6 II. ANNOTATION OF THE ATGENOME1 ARRAY (2001) A. Array Information The Affymetrix Arabidopsis Genome GeneChip (AtGenome1) array is the first generation Arabidopsis array designed by Affymetrix in collaboration with Novartis Agriculture Discovery Institute, Inc (NADII). There are ~ 8200 genes represented on the Affymetrix Arabidopsis GeneChip. Eighty percent of the features are predicted coding sequences from genomic bacterial artificial chromosome (BAC) while 20% of the features are represent high quality cdna sequences. Furthermore, there are > 100 EST clusters sharing homology with the predicted coding sequences from BAC clones. Each gene (probe set) is represented by sixteen probe pairs, with each probe consisting of 25-mer oligos localized to a 24µm feature size. B. Array Features We fully characterized the array by first examining the total number of features and genes represented on the array. Then, we examined the types of probe sets represented on the array based on the suffixes used to distinguish different probe types. B1. How many features are represented on the array? We counted the total number of probe sets on the array. Control probe sets have the AFFX designation in the header. Table 2. Total number of features represented on the array Description # Probe Sets Arabidopsis Genes 8,247 Proprietary Features 578 Controls 50 Total 8,875* * Total number of probe sets based on the paper by Zhu and Wang, Plant Physiol. 124, (2000). The actual number of probe sets we have information for is 8,297. B2. What are the representation of different suffixes on the array? Probe IDs were extracted from the annotation file provided by Affymetrix. Using Microsoft Excel and the COUNTIF function, we determined the distribution of different sufffixes on the array. This analysis only considers the 8,247 non-proprietary features on the array excluding controls (Table 2). Arabidopsis Array Annotation 6

7 Table 3. Distribution of suffixes on the array Suffix # Probe Sets _at 6,242 _s_at 1,423 _i_at 208 _g_at 196 _f_at 101 _r_at 77 Total 8,247 C. Array Annotation Strategy During the initial release of this array, very limited information was provided by Affymetrix and TMRI. We had to contact many people to obtain enough information to characterize and annotate the array (Figure 3). 1. We contacted Stacey Harmer, a post-doc in Steve Kay s laboratory, after reading her paper using this array (Harmer et al., Science 290, (2000)). We were able to obtain information regarding the sequences of the genes represented on the array. 2.We were directed to the Torrey Mesa Research Institute (TMRI) website ( gene_exp/index.html), a division of Novartis that supplied Affymetrix with the sequences used to design the oligos on the GeneChip. 3. Sequences were downloaded from the site and we manually annotated the genes, assigning functional categories based on BLAST results, published literatures, internet searches ( molecular and biochemistry books, etc. (similar to our approach with EST sequence analysis). 4. Ceres provided a small file containing annotations for some of the genes on the GeneChip. 5. Using all the resources we had, we manually annotated all 8200 genes assigning each gene into a functional category based on the EU Arabidopsis Sequencing Project (Figure 3). 6. Probe sets representing predicted or hypothetical proteins that couldn t be assigned into a category were placed in the Unknown Function category. Arabidopsis Array Annotation 7

8 Figure 3. AtGenome1 Array Annotation Strategy D. Array Annotation Summary Using the strategy presented in Figure 3, we manually annotated sequences on the array and assigned functional categories to each sequence based on BLAST results, references, web searches, etc. For genes that encode transcription factors, we assigned the keyword transcription factor with a description of the transcription factor family i.e. myb, homeobox, AP2 domain containing protein, etc. The results are summarized in Tables 4 & 5 and Figures 4 & 5. Arabidopsis Array Annotation 8

Figure 4. Distribution of 8,247 Features on the AtGenome1 Array. Table 4. Functional category distribution of all probe sets Functional Category # Probe Sets % Total Cell Growth & Division 172 2.

9 Figure 4. Distribution of 8,247 Features on the AtGenome1 Array. Table 4. Functional category distribution of all probe sets Functional Category # Probe Sets % Total Cell Growth & Division % Cell Structure % Disease & Defense % Energy % Intracellular Traffic % Metabolism % Protein Destination & Storage % Protein Synthesis % Secondary Metabolism % Signal Transduction % Transcription & Post-Transcription % Transporter % Transposon % Unknown Function % Total 8,247 Arabidopsis Array Annotation 9

10 Figure 5. Distribution of Transcription Factor Families on the AtGenome1 Array. Table 5. Distribution of transcription factor families on the array Transcription Factor Families # Probe Sets % Total AP2 Domain Containing Protein % Basic Leucine Zipper (bzip) % Basic Helix-Loop-Helix (bhlh) % CCAAT-box % DNA-Binding Protein % General Transcription Factor % Heat Shock % Homeobox % MADS-box % Myb % No Apical Meristem (NAM) % Scarecrow % Squamosa % WRKY % Zinc Finger % TFs represented by ONE probe set % TFs represented by 2-10 probe sets % Total 704 Arabidopsis Array Annotation 10

11 III. ANNOTATION OF THE ATH1 ARRAY (2003) A. Array Information The Affymetrix Arabidopsis ATH1 Genome GeneChip (ATH ) array is the second generation Arabidopsis array designed by Affymetrix in collaboration with The Institute for Genomic Research (TIGR). There are 22,746 probe sets representing 23,734 genes in the Arabidopsis genome (Redman et al., Plant J. 38(3) (2004)). 26,200 genes were available in the TIGR-ATH1 database as of December 15, To represent as many gene sequences on the array as possible, non-unique probe sets (_s_at) were used to represent two or more highly similar genes. Preference was given to genes where there was evidence of expression, supported by database matches and robust gene models. Each gene (probe set) is represented by eleven probe pairs, with each probe consisting of 25-mer oligos localized to an 18µm feature size. For more information about the array design, please read Redman et al., Plant J. 38(3) (2004). B. Array Features Similar to the analysis carried out in Section II, we wanted to fully understand the nature of the ATH1 array by characterizing the features and representation of probe sets on the array. B1. How many features are represented on the array? We counted the total number of probe sets on the array. Control probe sets have the AFFX designation in the header. Table 6. Total number of features represented on the array Description # Probe Sets Arabidopsis Genes 22,746 Controls 64 Total 22,810 B2. What are the representation of different suffixes on the array? Probe IDs were extracted from the annotation file provided by Affymetrix. Using Microsoft Excel and the COUNTIF function, we determined the distribution of different sufffixes on the array. This analysis only considers the 22,746 probe sets representing Arabidopsis genes and does not include control features. Arabidopsis Array Annotation 11

12 Table 7. Distribution of suffixes on the array Suffix # Probe Sets _at 21,685 _s_at 935 _x_at 126 Total 22,746 C. Array Annotation Strategy Unlike the AtGenome1 array, sequences on the ATH1 array were freely available and accessible. We used a similar strategy as the AtGenome1 array to annotate the ATH1 array. The steps taken to annotate the ATH1 array are summarized in Figure 6 and are elaborated in more details below: 1. We first downloaded the target sequences of each probe set on the array from the Affymetrix site ( These target sequences represent the last 600 bp of the coding sequence. Note: 3 untranslated regions (UTRs) were not included in the probe design. 2. Sequences were blasted against NCBI non-redundant database using the blastx program. 3. Using results from the blast run and descriptions provided by Affymetrix for each probe set, manually annotate each features using classification based on the EU Arabidopsis Sequencing Project. To expedite the annotation process, we used the functional category assignment of 6,714 probe sets from the AtGenome1 array that have comparable match to probe sets on the ATH1 array. 4. We marked probe sets representing ~5,000 full-length cdnas provided by Ceres to TIGR for gene annotation analysis. 5. We also included information of probe sets representing genes with known mutant phenotypes extrapolated from David Meinke s 2003 Plant Physiology paper (Meinke et al., Plant Physiol. 131, 2003). 6. Genes encoding transcription factors and signal transduction proteins were further classified into transcription factor families and signal transduction categories (e.g. kinases, phosphatases, etc.) based on published information, molecular biology books, and the web. Arabidopsis Array Annotation 12

7. There are three categories for unclassified proteins: unclassified - hypothetical proteins with no cdna support, unclassified - hypothetical proteins with cdna support, and unclassified - proteins

13 7. There are three categories for unclassified proteins: unclassified - hypothetical proteins with no cdna support, unclassified - hypothetical proteins with cdna support, and unclassified - proteins with unknown function. a. Unclassified - hypothetical proteins with no cdna support category includes sequences with description pertaining to hypothetical proteins or predicted proteins without any EST evidence. b. Unclassified - hypothetical proteins with cdna support category includes sequences with description indicating expressed proteins or predicted proteins supported by ESTs or Ceres s full-length cdna. c. Unclassified - proteins with unknown function category includes sequences representing proteins with known domains (i.e. WD40, DUF) but cannot be assign a category. Figure 6. ATH1 Array Annotation Strategy D. Array Annotation Summary Summaries of functional category distribution on the array were tallied up using Microsoft Excel and are shown in Tables 8 & 9 and Figures 7 and 8. Arabidopsis Array Annotation 13

14 Figure 7. Distribution of 22,746 Features on the ATH1 Array. Table 8. Functional category distribution of all probe sets on the ATH1 array Functional Category # Probe Sets % Total Cell Growth & Division % Cell Structure % Disease & Defense % Energy % Intracellular Traffic % Metabolism % Post-Transcription % Protein Destination & Storage % Protein Synthesis % Secondary Metabolism % Signal Transduction % Transcription % Transporter % Transposon % Unclassified - Proteins With cdna Support % Unclassified - Proteins With NO cdna Support % Unclassified - Proteins With Unknown Function % Total 22,746 Arabidopsis Array Annotation 14

15 Figure 8. Distribution of Transcription Factor Families Represented on the ATH1 Array. Table 9. Distribution of transcription factor families on the ATH1 array Functional Category # Probe Sets % Total AP2/EREBP % ARF % AUX/IAA % B3 Domain % Basic Helix-Loop-Helix (bhlh) % Basic Leucine Zipper (bzip) % CCAAT % CCR4-Associated 8 0.5% GATA % General % GRAS % Heat Shock % Homeodomain % LIM Domain 4 0.3% MADS-Box % MYB % NAC Domain % Polycomb group % Squamosa Promoter-Like % Arabidopsis Array Annotation 15

16 Functional Category # Probe Sets % Total SWI2/SNF % Unclassified % WRKY % YABBY 3 0.2% Zinc Finger % Total 1484 Arabidopsis Array Annotation 16

17 IV. COMPARISON OF THE ATGENOME1 AND ATH1 ARRAYS A. Comparison of Array Features We carried out experiments using both the first (AtGenome1) and second (ATH1) generation arrays for RNA profiling analysis. In order to correlate the results generated on the AtGenome1 versus ATH1 arrays, we needed to find the homologous sequence or probe sets represented on both arrays. Fortunately, Affymetrix carried out an extensive same-species array comparison (see the comparison sheet manual for more details: manual/comparison_spreadsheets_manual.pdf). The similarities and differences between the AtGenome1 and ATH1 arrays are summarized in Table 10. Table 10. AtGenome1 and ATH1 arrays comparison Array Features AtGenome1 (8K) ATH1 (22K) Number of Probe Sets 8,247 22,746 Oligonucleotide Length 25-mer 25-mer Number of Probe Pairs Probe Pair Distribution Contiguous Scattered Note: Probe pairs for a given probe set were designed to be scattered across the ATH1 array to ensure that information for a probe set would not be lost if a GeneChip is damaged locally in one section of the array. However, in the AtGenome1 array, since all probe pairs are next to each other, if the probe set is in a damaged area, then the probe set can no longer be used. B. Mapping of Probe Sets Across the Two Arrays Due to the dynamic nature of the public databases, probe sets between the two Arabidopsis arrays will not be identical. In some cases the same sequences will be represented by completely different probe sets, creating a challenge when comparing data sets generated on different generations. Table 11 shows results from the comparison analysis by Affymetrix. Table 11. Affymetrix Array Comparison Match Description* # Probe Sets Best Probe Match 6,714 Good Probe Match 6,731 Complex Probe Match 8,312 No Probe Match 516 Arabidopsis Array Annotation 17

18 * A detailed description of each probe match criteria is available from Affymetrix ( Figure 9 illustrates the relationships between probe sets from two arrays and the designation for each probe set. L&2(? " #! #. * ",& ",2 L&2(? " #H#. * ",& ",2 GR#/,%5#/1' PR#/ SR#/ =''0!>#/?+ ;R#/ TR#/ MR#/ *+%!('55'C12$!%2/,1%8!C155!#..%#,!12 /+%!01((%,%2/!8.,%#08+%%/8. * / 0;&< M**4#G 1,= - H&2,#G 1,= - GR#/ P4#/ GR#/ SR#/ GR#/ SR#/ GR#/ &R#/ GR#/ &R#/ GR#/ TR#/ 8 *#G 1,= - ;R#/ Figure 9. Possible Probe Set Relationships Between Arrays. In the case of the Arabidopsis arrays, Design A and B refers to the AtGenome1 and ATH1 arrays, respectively. The type of match (Best, Good, Complex) depends on the level of stringency applied to a probe set. This analysis was done at the probe level (25- mer) and not at the consensus or target sequence level. Arabidopsis Array Annotation 18

19 V. RE-ANNOTATION OF THE ATH1 ARRAY (2007) A. Motivation for Re-Annotation Efforts The Arabidopsis ATH1 array was annotated in 2003 using all the publicly available resources at the time (see Section III). The annotations were used to quickly annotate the Soybean Genome array (see the Soybean Genome Array Annotation summary for more details). In order to keep up with the increasing amount of information generated within the past four years since the annotation of the ATH1 array, we decided to re-annotate the ATH1 array in parallel with the soybean genome array. B. Array Re-Annotation Strategy The strategy for the re-annotation of the ATH1 array is summarized in Figure 10 below. Our strategy is as follows: 1. We updated the descriptions for each probe set on the array using TAIR Affy array descriptions (affy_ath1_array_elements txt). The description file was downloaded from the TAIR web site: ftp://ftp.arabidopsis.org/home/tair/microarrays. Descriptions were based on the latest release of the Arabidopsis genome TAIR 7 (released ). Note from TAIR: The mapping to the TAIR7 Transcripts was performed using the BLASTN program with e-value cutoff < 9.9e-6. For the 25-mer oligo probes used on the Affy chips, the required match length to achieve this e-value is 23 or more identical nucleotides. To assign a probe set to a given locus, at least 9 of the probes included in the probe set were required to match a transcript at that locus. Otherwise, the probe set was not assigned a locus and was given the description no match. 2. In addition to updating the descriptions for each probe set, we also updated gene ontology (GO) information provided by Affymetrix. 3. We gathered information about putative transcription factors from many publicly available TF database for Arabidopsis including: a. AGRIS - Arabidopsis Gene Regulatory Information Server ( b. DATF - Database of Arabidopsis Transcription Factors ( c. RARTF - Riken Arabidopsis Transcription Factor Database ( d. ArabTFDB - Arabidopsis Transcription Factor Database ( Arabidopsis Array Annotation 19

20 Transcription factors and transcription factor families were associated with each probe set on the array. Information obtained from points 1-3 were compiled together into an annotation file containing the 2003 ATH1 annotations. Transcription factors were automatically updated based on the information obtained from the databases in point We focused on probe sets that were previously assigned into the unclassified category. The rationale is that many of the sequences in the unclassified category might have update information that can be used to re-assign into a different category. Sequences previously assigned categories of protein synthesis or metabolism most likely will not change. Therefore, we first focused on re-assigning the 11,145 probe sets classified as unclassified in After the unclassified category was re-examined, we decided to re-examine the entire 22,746 probe sets on the array for consistent assignment of functional categories. We sorted all the probe sets by their description and made sure that probe sets with similar descriptions are assigned the same functional category. 6. We further examined the unclassified category that is divided into three groups as mentioned in Section IIIC. We obtained several files from TAIR that will distinguish the different sequences within the unclassified category. We downloaded several files from the TAIR site including: a. TAIR7_protein_coding_no_transcript_support_09_30_07 b. TAIR7_protein_coding_with_transcript_support_09_30_07 c. TAIR7_unknown_proteins_no_transcript_support_09_30_07 d. TAIR7_proteins_of_undefined_function_03_07 e. TAIR7_unknown_proteins_03_07 f. TAIR7_locus_type These files were compiled into one main table listing all the transcripts detected and/or predicted in the Arabidopsis genome. This list helps distinguish if a sequence has cdna support, represents a pseudogene/transposon, or is unknown. These files help re-assign the probe sets into appropriate unclassified categories. Arabidopsis Array Annotation 20

21 Figure 10. ATH1 Array Re-annotation Strategy C. Array Re-Annotation Summary The final annotation file contains a list of all the probe sets with annotation information from 2003 and the latest information (2007). Arabidopsis ATH1 Array Annotation version 2.4 is the current version used for all summaries. Tables 12 and Figure 11 summarizes functional category assignment of all probe sets on the array. Table 13 and Figure 12 summarizes all identified transcription factor probe sets, on the array. Table 12. Functional category distribution of all probe sets on the ATH1 array Functional Category # Probe Sets % Total Cell Growth & Division % Cell Structure % Disease & Defense % Energy % Intracellular Traffic % Metabolism % Arabidopsis Array Annotation 21

Functional Category # Probe Sets % Total Post-Transcription 668 2.9% Protein Destination & Storage 1573 6.9% Protein Synthesis 616 2.7% Pseudogene 320 1.4% Secondary Metabolism 458 2.

22 Functional Category # Probe Sets % Total Post-Transcription % Protein Destination & Storage % Protein Synthesis % Pseudogene % Secondary Metabolism % Signal Transduction % Transcription % Transporter % Transposon % Unclassified - Proteins With cdna Support % Unclassified - Proteins With NO cdna Support % Unclassified - Proteins With Unknown Function % Total 22,746 Figure 11. Distribution of 22,746 Features on the Re-Annotated ATH1 Array (2007). Table 13. Distribution of transcription factor probe sets on the array. Transcription Factor Families Pie Groups # Probe Sets % Total ABI3-VP1 ABI3-VP % AP2-EREBP AP2-EREBP % ARF ARF % Arabidopsis Array Annotation 22

23 Transcription Factor Families Pie Groups # Probe Sets % Total ARR-B ARR-B % AS2 AS % AUX-IAA AUX-IAA % bhlh bhlh % bzip bzip % CCAAT-Dr1 CCAAT 2 0.1% CCAAT-HAP2 CCAAT % CCAAT-HAP3 CCAAT % CCAAT-HAP5 CCAAT % G2-like G2-like % General General % GRAS GRAS % HB HB % HSF HSF % MADS MADS % MYB-related MYB % MYB MYB % NAC NAC % LFY Others 1 0.1% NZZ Others 1 0.1% SAP Others 1 0.1% ULT Others 1 0.1% LUG Others 2 0.1% GIF Others 3 0.2% MBF1 Others 3 0.2% S1Fa-like Others 3 0.2% Whirly Others 3 0.2% CSD Others 4 0.2% BZR-BES1 Others 5 0.3% Pseudo ARR-B Others 5 0.3% BBR-BPC Others 6 0.3% CPP Others 6 0.3% EIL Others 6 0.3% Sigma70-like Others 6 0.3% CAMTA Others 7 0.4% E2F-DP Others 7 0.4% PLATZ Others 7 0.4% GRF Others 8 0.4% TAZ Others 8 0.4% ARID Others % Nin-like Others % TUB Others % Arabidopsis Array Annotation 23

24 Transcription Factor Families Pie Groups # Probe Sets % Total HMG Others % FHA Others % LIM Others % GeBP Others % SBP Others % TCP Others % JUMONJI Others % PcG PcG % Trihelix Trihelix % Unclassified Unclassified % WRKY WRKY % VOZ ZF 2 0.1% HRT ZF 3 0.2% C2C2-YABBY ZF 5 0.3% SRS ZF 6 0.3% Alfin ZF 7 0.4% ZIM ZF % ZF-HD ZF % C2C2-GATA ZF % C2C2-Dof ZF % C2C2-CO-like ZF % PHD ZF % C3H ZF % C2H2 ZF % Total 1954 Arabidopsis Array Annotation 24

Figure 12. Pie Distribution of Transcription Factors Represented on the array. Other represents TF families with fewer than 20 members represented on the array (see Table13). D. Comparison of Old and New Annotations A.

25 Figure 12. Pie Distribution of Transcription Factors Represented on the array. Other represents TF families with fewer than 20 members represented on the array (see Table13). D. Comparison of Old and New Annotations A. We first examined the overall difference in annotations carried out in 2003 and Notice that there is a 83.6% reduction of the Unclassified - Proteins with NO cdna Support category. Interestingly, we see more than a 100% increase in probe sets classified in the Energy and Intracellular Traffic categories. Overall, a moderate increase occurred for each functional categories. Table 14. Comparison of old and new annotation of the ATH1 array Functional Categories % Change Cell Growth & Division % Cell Structure % Disease & Defense % Energy % Intracellular Traffic % Metabolism % Post-Transcription % Protein Destination & Storage % Protein Synthesis % Arabidopsis Array Annotation 25

26 Functional Categories % Change Pseudogene INF Secondary Metabolism % Signal Transduction % Transcription % Transporter % Transposon % Unclassified - Proteins With cdna Support % Unclassified - Proteins With NO cdna Support % Unclassified - Proteins With Unknown Function % Total B. We next examined what changed within each of the three unclassified categories shown earlier. Tables shows the assignment of probe sets previously classified as unclassified. Table 15. Changes within the unclassified - proteins with NO cdna support category (2003). Functional Category # Probe Sets % Total Cell Growth & Division % Cell Structure % Disease & Defense % Energy % Intracellular Traffic % Metabolism % Post-Transcription % Protein Destination & Storage % Protein Synthesis % Pseudogene % Secondary Metabolism % Signal Transduction % Transcription % Transporter % Transposon % Unclassified - Proteins With cdna Support % Unclassified - Proteins With NO cdna Support % Unclassified - Proteins With Unknown Function % Total 6381 Table 16. Changes within the unclassified - proteins with cdna support category (2003). Arabidopsis Array Annotation 26

27 Functional Category # Probe Sets % Total Cell Growth & Division % Cell Structure % Disease & Defense % Energy % Intracellular Traffic % Metabolism % Post-Transcription % Protein Destination & Storage % Protein Synthesis % Pseudogene 4 0.1% Secondary Metabolism % Signal Transduction % Transcription % Transporter % Transposon 6 0.2% Unclassified - Proteins With cdna Support % Unclassified - Proteins With NO cdna Support 0 0.0% Unclassified - Proteins With Unknown Function % Total 3229 Table 17. Changes within the unclassified - protein of unknown function category (2003). Functional Category # Probe Sets % Total Cell Growth & Division % Cell Structure % Disease & Defense % Energy % Intracellular Traffic % Metabolism % Post-Transcription % Protein Destination & Storage % Protein Synthesis 6 0.4% Pseudogene 8 0.5% Secondary Metabolism % Signal Transduction % Transcription % Arabidopsis Array Annotation 27

28 Functional Category # Probe Sets % Total Transporter % Transposon 0 0.0% Unclassified - Proteins With cdna Support % Unclassified - Proteins With NO cdna Support % Unclassified - Proteins With Unknown Function % Total 1535 Arabidopsis Array Annotation 28

STATC 141 Spring 2005, April 5 th Lecture notes on Affymetrix arrays. Materials are from

STATC 141 Spring 2005, April 5 th Lecture notes on Affymetrix arrays Materials are from http://www.ohsu.edu/gmsr/amc/amc_technology.html The GeneChip high-density oligonucleotide arrays are fabricated