NRPS Code Project Summary

Size: px
Start display at page:

Download "NRPS Code Project Summary"

Transcription

1 NRPS Code Project Summary Nick ill. ata formatting/trimming he data used in this project was obtained from a paper which detailed a machine-learning approach to the prediction of amino-acids encoded by a given -domain in non-ribosomal peptide synthesis. he training data used for this approach was taken from this paper, in.xls format. hen, all unnatural amino acid residues were stripped, leaving 19 amino-acid encoding sequence types (each type contained 1-~30 sequences). No methionine encoding -domains were in the training data, leaving only 19 amino-acids rather than 20. he resulting trimmed down.xls file was converted to.csv format, and read into the python script as a dictionary of lists (see.py file). ll further data manipulations and analysis was performed in Python (see.py file). B. High Level Overview he first aspect of the project was essentially visualizing the IC of the - domains to more easily make a hypothesis about whether a method simpler than the machine learning approach (reported by Rottig et al.) could be used to accomplish prediction based on a signature sequence of an -domain. his visualization was done with IC Bar Charts. hrough the following steps: 1. Read in CS, to get sequence data into usable dictionary format (readseq). 2. Construct a matrix of amino acid frequency of occurrence for each position in each set of -domain sequences corresponding to one amino-acid recognition (frequency). 3. Use the frequencies of each amino acid at each position in each set of - domain sequences to construct a consensus sequence for each -domain type (consensus)

2 4. Calculate entropy at each position of each consensus (Entropy) 5. Calculate information content at each position of each consensus (IC) 6. Plot bar charts of sequence position vs information content (plotbarchart) 7. Plot bar charts for all -domain sequences regardless of amino-acid specificity to see overall conservation 8. Make a weighted hamming distance function, where hamming distance scales with information content at a given position (eightedhamming). 9. ake an input sequence and predict its amino-acid specificity by minimizing its weighted hamming distance against all possible (18 others) amino-acid -domain consensus sequences, and return the minimized amino acid. Furthermore, check to make sure the returned amino acid actually contains a sequence identical to the one input output match or mismatch to indicate success of predictor. C. iscussion In both using 34-mer inputs and 10-mer inputs, the predictor is correct a bit over half of the time (ables 1 and 2). iven the low information content for the amino-acid sequences which are predicted poorly, this makes some sense (Figures 1 and 2). ypically, -domain consensuses with high information content are predicted well (rginine, lanine, lutamine), however there certainly are outliers to this trend. hile Serine and hreonine are highly conserved amongst many -domains, they are each only predicted in one of the two prediction models (one in the 10-mer model, and one in the 34-mer model). very odd successful prediction is that of valine. hile the only well conserved residues in valine (of which we have 28 examples, figure 2) are well conserved across all -domains, valine is still predicted by both the 10-mer and 34-mer predictors. More complex weighting in the hamming distance function might improve prediction power to include serine and threonine, and further not predict valine. hile it is nice that valine is predicted well, it is likely only an anomaly, and would not hold if the predictor was applied to larger data sets.

3 he most obvious problem with the current predictor is the necessity for extensive preparation of the data beforehand. he 34 residues taken as a signature sequence in half of the predictions are selected from an 8 Å radius ball centered in the amino acid binding site. hus, the ordering of the amino acids does not reflect the ordering present in the primary structure. Furthermore, the assumption that these residues will still lie within the 8 Å radius ball in all - domains relies on the assumption that the conformation of all -domains will be very similar to the only one we have a crystal structure for (Marahiel et al.). his is an even more obvious problem when we take the 10-mer sequences as the signatures, since one of these 10 residues lying outside the recognition site will unduly affect the prediction. he weighted Hamming function could be improved with a more detailed look at the logos (figures 3 and 4) of each -domain sequence. ith an understanding of which residues in the -domain are crucial to recognition of a given, the Hamming function could be weighted accordingly.

4 REFERENCE Figure 1. Information content by position for all 19 characterized amino acid -domains, and a reference information content for all sequences (with length 10), regardless of -domain specificity.

5 REFERENCE Figure 2. Information content by position for all 19 characterized amino acid -domains, and a reference information content for all sequences (length 32), regardless of -domain specificity.

6 able 1. Prediction of for a 34-mer sequence randomly selected from each known sequence set. Match indicates prediction success. 34-mer sequence Known Predicted Full Match? 34-mer Match? NPFLSMPSLFCENPE ala ala FSESCIENPEEI arg arg SFLKIENEPECC asn ala N N FSFLSPKLLEINHPEII asp asp LSLSFHFEQSCEINPESIHK cys ala N N LLFSQQLENPECSS gln gln LLFSKQMINPECCS glu val N N FMFISLELQLENLPEISF gly gly NSFSFFILFEIHPES ile ala N N LFSIEPFLLNNPENS leu leu HFQPLEFNCPEE lys val N N QFESLINLECM phe phe LFEFCQESSIEHNHPSEHS pro pro RMFSEHFMCSEHNLPE ser ser LHQHFFSENQIFEINMIEH thr val N N LRFSMPMSHNEES trp val N N RFFCSLIFENEPENSI tyr ala N N LNFSFELIINPENFSC val val able 2. Prediction of for a 10-mer sequence randomly selected from each known sequence set. Match indicates prediction success. 10-mer sequence Known Predicted Full Match? 10-mer match? LMLC- ala ala IE- arg arg LKE- asn ala N N LKHI- asp ala N N HESI- cys ala N N QL- gln gln KI- glu ala N N ILQLLI- gly gly FFL- ile ala N N FLN- leu leu QCE- lys lys IC- phe phe QIH- pro pro HMSL- ser ala N N FNIMH- thr thr E- trp ala N N LE- tyr ala N N FIF- val val

7 H QK MSF FE M I N M E H F C IH bits ll 10-mers Logo L L F I L N F L CS N 5 EK 10 eblogo 3.4

8 M L H I R S P H Q S K Q M LS F F L E M S C N I HFI M F M E LS H Q C F C M HIH CN I S F bits N S FL H S 5 L I F S L F 10 ll 32-mers Logo N PFL F F L Q C S I EM CS I L 15 SE20 H I L HNN 25 S SN C SE IP 30 S I S eblogo 3.4