Targeted Proteomics for the Identification of Nucleic Acid Binding Proteins in E. coli: Finding more low abundant proteins using the QSTAR LC/MS/MS System, Multi- Dimensional Liquid Chromatography, Pro Automate Software, and the Celera Discovery System TM Christie Hunter, Lydia Nuwaysir Applied Biosystems, CA, USA Introduction Two-dimensional gel electrophoresis coupled with mass spectrometry is regarded as a powerful tool in the separation and identification of complex protein samples. Despite its high resolving power, the technique has limitations for the separation and identification of membrane and low abundance proteins. Multidimensional liquid chromatography (MDLC) coupled to mass spectrometry provides an alternative to this approach, and allows access to these proteins. In the most popular form of this technique, protein mixtures are enzymatically digested and the resulting peptides are loaded onto a cation exchange chromatography column for fractionation either off-line or on-line via a salt gradient. These fractions are then further separated by reverse phase LC coupled to a mass spectrometer. To ensure good coverage of the proteome, it is becoming increasingly apparent that a more targeted approach to proteomics is advantageous. In addition, the ability to acquire data in a more results driven manner is essential. In this work, we use a protein function-based sample simplification step, followed by multiple rounds of MDLC using Specific Mass and Retention Time (SMART) exclusion lists generated from protein identification results to increase protein coverage and dig deeper into the proteome. Figure 1. Fatty acid responsive transcription factor (FADR) - DNA complex: transcriptional control of fatty acid metabolism in E. coli. MDLC in combination with a QSTAR Pulsar quadrupole-time of flight mass spectrometer was used to investigate a subset of E. coli proteins, the nucleic acid binding proteins. These proteins were isolated from an E. coli lysate using a DNA affinity column. Multiple MDLC runs were performed using sequential SMART exclusion lists created from protein identifications in the previous runs. To assess the relative merit of this approach for identification of low abundance proteins, codon adaptation index (1) values were calculated on the protein results after each analysis. Finally, the molecular functions and biological processes of the identified proteins were investigated using the PANTHER Protein Function-Family Browser (2) in the Celera Discovery System (3) to gain insight into the quality of the experimental protocol and provide valuable information about the identified proteins. 5 mm Experimental 25 mm Sample Preparation: The nucleic acid binding proteins were purified from an E. coli cell lysate using a DNA cellulose column (Sigma). The proteins were sequentially eluted using two salt concentrations: 0.4 M NaCl (low salt elution weak binding fraction) and 1 M NaCl (high salt elution strong binding fraction). Each fraction was then desalted, reduced and alkylated with iodoacetamide, and digested with trypsin. 100 mm 250 mm Figure 2. NanoLC analysis of cation exchange fractions.
Chromatography: MDLC (Multi-dimensional liquid chromatography) was performed using the LC Packings integrated system (Dionex) consisting of a FAMOS micro autosampler, Switchos micro column switching module, and UltiMate micro pump. Each fraction was first loaded onto a Bio-SCX cation exchange trap cartridge (0.5 x 15 mm) then eluted stepwise onto a PepMap TM C18 trap cartridge (0.3 x 5 mm). Salt steps (200 ml each) used were 0, 5, 10, 25, 50, 100, 250 and 1000 mm ammonium acetate in 0.1% formic acid. Finally, peptides were eluted off the reverse phase trap cartridge onto the PepMap TM C18 analytical column (0.075 x 150 mm) using a linear gradient of 5-35% acetonitrile in 0.1% formic acid. Mass Spectrometry: All LC/MS/MS data were automatically acquired using Information Dependent Acquisition (IDA) on the QSTAR PULSAR LC/MS/MS System. Pro Automate Software was used to automate the acquisition of the MDLC data, the database searching of the MS/MS data, and the generation of the time-filtered exclusion lists for subsequent MDLC runs. Data Processing: All MS/MS spectra were automatically submitted for database searching using Pro ID Software, which identifies proteins from MS/MS spectra using the Interrogator Database Search algorithm. Either an E. coli specific subset of the NCBI database or the E. coli CDS FASTA file was used for database searching. Codon Adaptation Index calculations were performed in house using the method of Sharp and Li (1). To gain additional information on proteins of interest, the Celera Discovery System was accessed. Results Using Time Filtered Exclusion Lists To Dig Deeper Into Your Sample SMART (Specific Mass And Retention Time) exclusion filters were used to allow data acquisition from as many unique peptides as possible. SMART exclusion lists were automatically generated from protein identification results using Pro Automate Software (Figure 3). For these experiments, a complete set of MDLC MS/MS runs was performed and subjected to database searching. Exclusion lists were generated from the high confidence peptide matches (> 80%) and thus were specifically excluded from being sent for MS/MS in subsequent MDLC runs. For the weak nucleic acid binding fraction (low salt elution), a third MDLC run with an exclusion list was acquired to insure the majority of the detectable peptides were identified. Figure 3. Automatic generation of time-filtered exclusion lists in the Pro Automate software. High Salt Elution (strong binding) Low Salt Elution (weak binding) Using the Protein Score (ProtScore) from Pro Automate (calculated from the peptide evidence for each protein), the data were filtered at the protein level. At a very conservative protein confidence threshold of 95%, 295 unique proteins were identified to be present in the two DNA cellulose column fractions. Of these proteins, 69 were found to be present in both the low salt and high salt elution fractions with the majority eluting in the low salt fraction (Figure 4). 90 69 136 Figure 4. Proteins identified in each nucleic acid binding protein fraction.
In the first MDLC run from the weak binding fraction (low salt elution), 113 proteins corresponding to 537 unique peptides were identified with high confidence (> 95%). Using SMART exclusion lists, the second MDLC run was performed on the same sample and an additional 66 proteins were identified (1457 unique peptides in total). A third MDLC run was then performed using exclusion lists built from the first two MDLC results, yielding an additional 24 unique proteins and 1879 unique peptides in total (Figure 5). 203 proteins in total were found in the weak binding fraction. For the strong binding fraction, a similar trend was observed. The first MDLC run (high salt elution) yielded 129 proteins corresponding to 856 unique peptides with high confidence (> 95%). Using specific exclusion lists, the second MDLC run was performed on the same sample and an additional 30 proteins were identified (1493 unique peptides in total). In this work, a ~20% increase in protein coverage was obtained for each subsequent MDLC run where peptide based exclusion lists were applied. Thus, using multiple rounds of MDLC and applying time-based and peptide-based exclusion lists enabled the identification of more proteins overall and the improvement of sequence coverage for many of these proteins. Identifying More of the Low Abundant Proteins # Unique Sequences 2000 1500 1000 500 0 Run 1 Run 1,2 Run 1,2,3 Injection Injection Figure 5. Number of unique peptide sequences (blue) and number of total proteins (red) found in each of the sequential MDLC MS/MS runs from the weak nucleic acid binding protein fraction. # Proteins 250 200 150 10 0 50 0 R un 1 R un 1,2 R un 1,2,3 Figure 6. Effects of multiple rounds of time-based exclusion on the proportion of low abundance proteins found in each run of the weak nucleic acid binding fraction. Because of the degeneracy of the genetic code, most amino acid residues can be encoded by more than one codon. In genomes, certain codons will be favored by genes despite the availability of other codons that encode for the same residue. This tendency of a gene to use specific codons to encode for amino acids is called codon bias. Codon Adaptation Index (CAI) is a measure of codon bias. It uses a reference set of highly expressed genes against which the codons from all other genes are compared (1). In E. coli, lower CAI values are thought to correlate well with proteins that are less abundant. In the third MDLC run from the weak binding fraction, a higher proportion of proteins with lower CAI values are observed (Figure 6, yellow bars), indicating that multiple rounds of acquisition using cumulative exclusion lists is a powerful and effective strategy to enable MS/MS to be obtained from peptides originating from lower abundance proteins.
Linking Protein Identification With Biology Using The Celera Discovery System TM The proprietary PANTHER Protein Classification System (3) organizes proteins into families and subfamilies based upon global sequence similarity, common molecular functions, and participation in common biological processes. Processing of the MS/MS spectra against the E. coli CDS FASTA database using the Pro ID Software allows the gene ontology information to be visualized along with the protein identification results. Biological process and molecular function information can be used to quickly find proteins based upon similar biological attributes. For example, a large proportion of the proteins identified in this study were classified as nucleic acid binding proteins by their molecular functions. Additionally, other interesting protein classes can be rapidly identified, such as transcription factors (Figure 7). Figure 7. Pro ID Software Protein Summary results sorted by molecular function. Inset: PANTHER gene ontology information from CDS showing the protein family and subfamily, the biological process and the molecular function (small box). Figure 8. PANTHER molecular functions represented for the proteins identified from all DNA affinity column fractions. Figure 8 displays a pie chart of the different molecular functions represented by the proteins identified in this sample. As shown, a significant fraction of the proteins greater than 1/3 are classified as nucleic acid binding proteins, indicating the degree to which the sample preparation strategy was successful. A small percentage of the proteins are also classified as transcription factors a further indication that the data acquisition strategy was successful for identifying what are typically considered to be low abundance proteins.
Figure 9. MS/MS spectrum of peptide with sequence YLTEQGFQVR from Outer Membrane Protein R (OmpR). Understanding The Proteins Identified OmpR, an osmoregulatory DNA-binding protein is normally expressed in low abundance and has a CAI value of 0.268. In this study, OmpR was identified with 9 unique peptides in the weak binding fraction, resulting in a 41% sequence coverage for this 27 kda protein. Interestingly, 4 out of the 9 peptides were identified in the third MDLC run, further supporting the claim that cumulative exclusion lists enable detection of peptides from lower abundant proteins as well as increasing overall protein coverage. The PANTHER classifications for OmpR indicate that this protein is a transcription factor (molecular function) and is involved in mrna transcription (biological process). The Tree Viewer (Figure 10) displays the relationship between the different sequences within a family. The longer the horizontal branch length, the more distant the groups joined by those branches. The OmpR protein belongs to SF20 Transcriptional Regulatory Protein OmpR- Related at the bottom of the tree and is thus more distantly related to many other sequences in the tree relative to sequences in some other subfamilies. OmpR is part of a two-component regulatory system, in conjunction with EnvZ, for control of the porin proteins OmpC and OmpF (Figure 11). In response to the osmolarity of the medium, EnvZ phosphorylates or stimulates the dephosphorylation of OmpR which then acts to selectively stimulate or repress expression of OmpC and OmpF, thereby affecting the pore size in the outer membrane. Figure 10. PANTHER distance trees allow exploration of the relationships between sequences in a particular family, as well as visualization of some of the key information that was used to annotate the families and subfamilies. Figure 11. The E. coli pathway for the two-component system involving the OmpR transcription factor. This diagram was obtained from the Kyoto Encyclopedia of Genes and Genomes (KEGG): http://www.genome.ad.jp/kegg/.
Conclusions To improve the effectiveness of proteome-wide protein identification, previous studies have emphasized reducing sample complexity and increasing chromatographic separation as a means of ensuring good MS/MS coverage of complex mixtures (4,5). Here, we demonstrate a variation of this approach using MDLC and specific time-based exclusion lists generated from protein identification results, in conjunction with a sample simplification step based upon protein function. Depending upon sample complexity, multiple MDLC/MS/MS runs can be performed with cumulative levels of exclusion applied to obtain good MS/MS coverage on peptides in the sample. For the present study, this data acquisition strategy enabled the identification of a greater proportion of lower abundance proteins (as indicated by their codon bias) as well as greater coverage (more peptides identified) per protein. A total of 295 proteins were identified with very high confidence in this experiment from the E.coli sample enriched for nucleic acid binding proteins. Using the Celera Discovery System, important information can be readily accessed about identified proteins. Performing the database search against the annotated CDS databases allows the gene ontology information (biological process, molecular function) from the PANTHER protein classification system to be automatically imported into the Pro ID Software results. In this study, this information allowed quick assessment of the quality of the sample preparation strategy. Additionally this information provided a means for rapidly identifying interesting subsets of proteins, such as transcription factors. Many additional tools and information exist within the Celera Discovery System to allow further exploration of interesting proteins. Acknowledgements Thanks to Doug Barofsky and Martha Stapels at the Oregon State University for the E. coli proteins digests. References 1. Sharp, P.M, and Li, W-H. The codon adaptation index a measure of the directional synonymous codon usage bias, and its potential applications (1987) Nucleic Acids Res. 15, 1281-1295. 2. Thomas, P.D, Kejariwal, A., Campbell, M.J., Mi, H., Diemer, K., Guo, N., Ladunga, I., Ulitsky-Lazareva, B., Muruganujan, A., Rabkin, S., Vandergriff, J.A., Doremieux, O. PANTHER: a browsable database of gene products organized by biological function, using curated protein family and subfamily classification (2003) Nucleic Acids Res. 31, 334-341. 3. Kerlavage, A., Bonazzi, V., di Tommaso, M., Lawrence, C., Li, P., Mayberry, F., Mural, R., Nodell, M., Yandell, M., Zhang, J., and Thomas, P. The Celera Discovery System TM (2002) Nucleic Acids Res. 30, 129-136. 4. Gygi, S.P, Rist, B., Griffin, T.J., Eng, J., Aebersold, R. Proteome Analysis of Low-Abundance Proteins Using Multidimensional Chromatography and Isotope-Coded Affinity Tags (2002) J. Proteome Res. 1, 47-54. 5. Corbin R.W. et al., Toward a Protein Profile of Escherichia coli: Comparison of its Transcription Profile (2003) PNAS 100, 9232-9237. AB (Design), Applera, Celera Discovery System, Interrogator and PepMap are trademarks and Applied Biosystems and QSTAR are a registered trademarks of Applera Corporation or its subsidiaries in the US and certain other countries.