Determining presence/absence threshold for your dataset

Size: px
Start display at page:

Download "Determining presence/absence threshold for your dataset"

Transcription

1 Determining presence/absence threshold for your dataset In PanCGHweb there are two ways to determine the presence/absence calling threshold. One is based on Receiver Operating Curves (ROC) generated for microarray data of reference strains and the other is based on plotting histograms of presence of scores for ortholog groups of reference strains. Note: this guide uses publicly available data that was also described in the manuscript describing the PanCGH algorithm (PMID: ). However, these steps can be applied to any dataset. In order for the signal distribution to be comparable across arrays, make sure that arrays are within-array and between-array normalized (see Fig. 6). Finding optimal presence/absence calling threshold using ROC curves ROCs enable a user to define an optimal presence / absence threshold taking into account the tradeoff between false-positive rates and true-positive rates. Generating ROCs and determining an optimal PanCGH presence/absence threshold ROCs can be generated for your dataset following the steps below. 1. Select only reference strains used for the array probe design from the NCBI genbank drop-down list (see Fig. 1.A). In demo run mode you can select the indicated 4 Lactococcus lactis strains. 2. Upload the microarray probe sequences as a FASTA file (see Fig. 1. B). If you are running the program in demo mode skip this step. 3. Click the Upload File(s) button to proceed to upload array files. 4. Upload array files one by one. These files should contain probe signals for strains selected in step 1 (see Fig. 2). Click Proceed to go to the parameter settings page. 5. In the parameter settings page the option Presence/absence calling threshold determination has a default value of Predefined. Change it to Optimal (see Fig. 3.A). The genotype calling process will be initiated once you click Proceed. 6. In the run phase the genotype calling progress will be shown (see Fig. 4). ROCs will be generated along with other plots after the genotype calling method has finished. Click on the ROC curves (Fig. 5.B) link to open a page that shows figures with ROCs (Fig. 7). 7. Each plot shows a ROC of all reference strains based on data of an uploaded array. For example in Fig. 8 ROCs of 4 selected reference strains based on an array hybridized with IL1403 is shown. In the figure legend (below right corner in Fig. 8) the Genbank accession id of each reference strain is shown. NC_ is the genbank accession id of L. lactis IL1403. The threshold around 5.5 would result in better false-positive and true-positive rates. Figures 9 to 11 show ROC curves of reference strains for three other arrays. 1

2 Finding the optimal presence/absence calling threshold using histograms The presence / absence calling threshold can also be determined using histograms that are created by following the steps described below. 1. Select only reference strains from the genbank list shown in Fig. 1.A. Do not upload any other sequence data except probe sequences (see Fig. 1.B). 2. Only upload array files where the selected strains (Fig. 1.A.) were hybridized. Therefore, set the number of array files accordingly (see Fig. 1.C.; in this example 4). Click Upload File(s) button to start uploading sequence files. 3. Upload array files, where selected strains were hybridized (see Fig. 2). Click Proceed to go to parameter settings page. 4. In the first run use default settings (see Fig. 3). After inspection of the histograms (see below), an optimal presence / absence calling threshold can be determined and this value should be used in the next run. 5. Click the Histograms link (see Fig. 5.C) to open a histogram for each reference strain. A plot that is based on array data where this strain was hybridized should be selected (see Fig. 12). For instance NC_ is a Genbank accession id for a strain Lactococcus lactis IL1403 and the corresponding array name used for this strain was IL1403. So opening that figure would show a histogram as in Fig. 13. It shows the distribution of presence scores of OGs. Using the genome annotation the presence / absence of genes in L. lactis IL1403 is known. Therefore, OGs are divided into 2 groups: OGs containing at least one gene from of IL1403 (black) and OGs with no gene from IL1403 (grey). From this plot it can be concluded that that an optimal presence/absence threshold should be between 5.2 and 5.8. However, it is important to take into account that a pangenome array not only targets genes of a single strain. So the optimal presence/absence threshold should be determined by considering the values for other reference strains as well (see Figures 14 to 16). 6. Choose the presence/absence threshold that is optimal for all reference strains. Based on Figs a threshold of 5.5 is optimal for the 4 reference strains. 7. Restart the program (see Fig. 1.D). 8. Repeat steps 1 to 3, but in step 4 use the threshold value you determined in step 6 (see Fig. 3.B). 2

3 D A B C E Fig. 1. Start page of PanCGHweb. 3

4 Fig. 2. Upload microarray data. 4

5 A B Fig. 3. Parameters settings page. 5

6 Fig. 4. Run phase of PanCGHweb. 6

7 A B C Fig 5. Results page. 7

8 Fig. 6. Box and whisker plot of all probe signals. 8

9 Fig. 7. Page showing figures with ROC curves of reference strains. 9

10 Fig. 8. ROC curves of reference strains for array where strain IL1403 was hybridized. 10

11 Fig. 9. ROC curves of reference strains for array where strain KF147 was hybridized. Fig. 10. ROC curves of reference strains for array where strain MG1363 was hybridized. 11

12 Fig. 11. ROC curves of reference strains for array where strain SK11 was hybridized. 12

13 Fig. 12. Page showing histograms of reference strains. 13

14 Fig. 13. Distribution of OGs containing at least a gene from IL1403 (black) and OGs containing no gene from IL1403 (grey). 14

15 Fig. 14. Distribution of OGs containing at least a gene from KF147 (black) and OGs containing no gene from KF147 (grey). 15

16 Fig. 15. Distribution of OGs containing at least a gene from MG1363 (black) and OGs containing no gene from MG1363 (grey). 16

17 Fig. 16. Distribution of OGs containing at least a gene from SK11 (black) and OGs containing no gene from SK11 (grey). 17