The first thing you will see is the opening page. SeqMonk scans your copy and make sure everything is in order, indicated by the green check marks.

Size: px

Start display at page:

Download "The first thing you will see is the opening page. SeqMonk scans your copy and make sure everything is in order, indicated by the green check marks."

Noah McCarthy
5 years ago
Views:

1 Open Seqmonk Launch SeqMonk The first thing you will see is the opening page. SeqMonk scans your copy and make sure everything is in order, indicated by the green check marks. SeqMonk Analysis Page 1

2 Create New Project To use SeqMonk, you need to create a new project and chose a genome related to your experiment 1. Under the top menu, go to File and select New project When prompted to select a genome, chose GRCh37 under the Homo sapiens folder 3. Click OK to proceed SeqMonk Analysis Page 2

SeqMonk First Look SeqMonk layout is divided into 4 panels; Quick Access Panel, List Panel, Chromosome Panel, and Track Panel Quick Access Panel: A series of buttons allow quick access to various

3 SeqMonk First Look SeqMonk layout is divided into 4 panels; Quick Access Panel, List Panel, Chromosome Panel, and Track Panel Quick Access Panel: A series of buttons allow quick access to various layout, navigation and search functions List Panel: A listing of all the imported and created files Chromosome Panel: A quick bird's eye view of data signal on the chromosomes Track Panel: A detail view of annotation and data tracks SeqMonk Analysis Page 3

Import BAM Importing BAM files into SeqMonk 1. To import data into SeqMonk, go to File, chose Import Data, then BAM/SAM... 2. Navigate to the BAM files location, highlight and select all BAM file (.

4 Import BAM Importing BAM files into SeqMonk 1. To import data into SeqMonk, go to File, chose Import Data, then BAM/SAM Navigate to the BAM files location, highlight and select all BAM file (.bam), and click Open. On the new Import Options window, follow these instructions: 3. Min mapping quality: Data Type: Single End 5. Extend reads by (bp): Click Import to start importing the files SeqMonk Analysis Page 4

5 Import In Progress BAM files are huge, please allow some time to finish the importing process SeqMonk Analysis Page 5

Mitochrodial Genome Was Not Imported At the end of the importing process, SeqMonk will show that it did not import Mitochondria chromosome data.

6 Mitochrodial Genome Was Not Imported At the end of the importing process, SeqMonk will show that it did not import Mitochondria chromosome data. That is OK, click Close to continue Note: Different software packages interpret the Mitochondria naming system differently. In this case, SeqMonk is expecting Mitochrodria to be named as "M", but our BAM files is naming it "chrm". Therefore, rendering SeqMonk unable to import Mitochondria reads. SeqMonk Analysis Page 6

What is "Define Probe"? Reads quantitation is a 2 steps process; Define Probe and Quantitation Define Probe 1. A Probe is a predefined region on the genome.

7 What is "Define Probe"? Reads quantitation is a 2 steps process; Define Probe and Quantitation Define Probe 1. A Probe is a predefined region on the genome. Here we can use many different methods to define Probes: gene, mrna, or CDS Quantitation Quantitation is a process of quantifying the amount of reads within the Probe region 2. Define Probe by gene/mrna. Here a Probe is being defined using the gene/mrna region, and the read quantitation is being represented in this region Note: when using mrna to define Probe, the algorithm only include reads in exons, not intron. On the other hand, if gene is used, reads in exons and introns will be included. 3. Define Probe by CDS. Here a Probe is being defined using the CDS region, and the read quantitation is being represented in this region. SeqMonk Analysis Page 7

8 Define Probe RNA-seq Pipeline To quantify the reads for RNA-seq experiment, we will use a Quantitation Pipeline approach. 1. To start the quantitation pipeline, go to Data, then select Quantitation Pipeline A new Define Quantitation window appears for more option, please chose: 2. Select RNA-Seq quantitation pipeline Option 3. Transcript features: mrna 4. Library type: Non-strand specific 5. Merge transcript isoforms: check 6. Log transform: check 7. Apply transcript length correction: check 8. Click Run Pipeline to continue SeqMonk Analysis Page 8

9 Result of Probe Definition After read quantitation, 31,017 Probes were defined. This is being shown on the List Panel, under the Probe Lists SeqMonk Analysis Page 9

QC Inspection of Reads We will do a visual inspection on the imported samples 1. At the Chromosome Panel, use your mouse to highlight the left most region of Chromosome 4. 2.

10 QC Inspection of Reads We will do a visual inspection on the imported samples 1. At the Chromosome Panel, use your mouse to highlight the left most region of Chromosome Careful examination reveals that sample ABC_Ly3.bam is particularly noisy; having reads scattered all over the region Based on this assessment, we have decided to remove sample ABC_Ly3.bam SeqMonk Analysis Page 10

Remove Bad Sample 1. To remove a sample, go to Data, and select Edit Data Sets... 2. On the new Edit DataSets.

11 Remove Bad Sample 1. To remove a sample, go to Data, and select Edit Data Sets On the new Edit DataSets... window, select the bad sample ABC_Ly3.bam 3. Click Delete Dataset to remove the sample from the project SeqMonk Analysis Page 11

Create Replicate Dataset: Step 1 Next, we will group samples into 2 replicate sets: ABC and GCB 1. To group replicate set, go to Data, and chose Edit Replicate Sets... 2. On the Edit Replicate Set.

12 Create Replicate Dataset: Step 1 Next, we will group samples into 2 replicate sets: ABC and GCB 1. To group replicate set, go to Data, and chose Edit Replicate Sets On the Edit Replicate Set... window, click Add New Replicate Set to add the first replicate set 3. We will name the first replicate set ABC See next step on how to assign samples into each replicate set... continue... SeqMonk Analysis Page 12

13 Create Replicate Dataset: Step 2 Assigning samples into each replicate sets 1. Highlight to select the ABC replicate set 2. Highlight to select all the ABC samples (use shift key to make multiple selection) 3. Click Add to assign these samples to the ABC replicate set 4. Do the same for GCB replicate set. SeqMonk Analysis Page 13

Add Rep Track to Track Panel We will add the newly created Replicate Set onto the Track Panel 1. To add data track, go to View, and select Set Data Tracks... 2.

14 Add Rep Track to Track Panel We will add the newly created Replicate Set onto the Track Panel 1. To add data track, go to View, and select Set Data Tracks In the new Select Data Track window, highlight both ABC and GCB Replicate Sets 3. Click Add to add these data onto the Track Panel 4. Here, it shows that the new data has been added Note: Examine the Track Panel where the replicate sets ABC and GCB have added to the bottom of the tracks. SeqMonk Analysis Page 14

Quick Access Panel: Positive and Negative Scale 1. Positive and Negative Scale Show both the positive and negative scale of the signal intensity 2.

15 Quick Access Panel: Positive and Negative Scale 1. Positive and Negative Scale Show both the positive and negative scale of the signal intensity 2. Positive Scale Show only the positive scale of the signal intensity FYI: Why is there negative value? Since the quantitation was done by normalizing or dividing the sum of reads over the length of the Probe (or mrna), it could produce a value which is less then 1. When logging (base 2) values that are less then 1, we get a negative value. Example: Given a Probe of 2000 base-pair in length, 20 reads were mapped to this Probe Therefore, the intensity value would be: intensity = log 2 (20/2000) = SeqMonk Analysis Page 15

16 Quick Access Panel: Dynamic vs. Static Data Colors 1 Dynamics Data Colors When Dynamic Data Colors is used, the Probe Quantitation bar change color according to the amount of reads found within the Probes. 2. Static Data Colors When Static Data Colors is used, the color retain constant in the Probe Quantitation bars. SeqMonk Analysis Page 16

17 Quick Access Panel: Show Probe and Reads 1. Show Reads Only Only show the reads distribution 2. Show Probe Quantitation Only Only show the Probe Quantitation bars 3. Show Both Reads and Probe Quantitation Show both the read distribution and Probe Quantitation bars together. SeqMonk Analysis Page 17

18 Quick Access Panel: Read Density Range 1. Low Read Density Display read distribution in LOW density setting 2. Medium Read Density Display read distribution in MEDIEUM density setting 3. High Read Density Display read distribution in HIGH density setting SeqMonk Analysis Page 18

Quick Access Panel: Combine and Split Packed Reads 1. Combine Packed Reads Display read distribution by mixing the forward and reverse strand reads 2.

19 Quick Access Panel: Combine and Split Packed Reads 1. Combine Packed Reads Display read distribution by mixing the forward and reverse strand reads 2. Split Packed Reads Display read distribution for forward and reverse strand reads separately (Forword on top [Red], and reverse on bottom [Blue]) SeqMonk Analysis Page 19

20 Quick Access Panel: Change Annotation and Data Tracks 1. Change Annotation Tracks Activate to add, remove or organize the Annotation Tracks 2. Change Data Tracks Activate to add, remove or organize the Data Tracks SeqMonk Analysis Page 20

21 Plot Probe Value Histogram Plot the histogram for the overall Probe quantitation value 1. Go to Plots, then select Probe Value Histogram 2. Adjust the Division level for a more granular view of the signal Note: The Probe Value Histogram gives us a sense of the distribution of positive vs. negative probe (mrna in this case) quantitation. Here, we see that negative probe value is slightly higher than positive. SeqMonk Analysis Page 21

22 Plot Read Length Histogram Plot the histogram for the overall Read Length 1. Go to Plots, then select Read Length Histogram 2. Here, the plot shows that all reads have the same length; which is 86 nucleotide in length Note: the original read length is 36, recall the during the Import BAM step, we extended the reads by 50 bp. (see Page 4) SeqMonk Analysis Page 22

23 Plot Probe Length Histogram Plot the histogram for the different Probe Length 1. Go to Plots, then select Probe Length Histogram 2. Most Probe (mrna in this case), have relatively short length 3. The probe length result is more apparent when set to Log scale SeqMonk Analysis Page 23

Plot Correlation Matrix Plot the Correlation Matrix for all data tracks 1. Go to Plots, then select Correlation Matrix... 2.

24 Plot Correlation Matrix Plot the Correlation Matrix for all data tracks 1. Go to Plots, then select Correlation Matrix The Correlation Matrix shows that samples in the same group (ABC or GCB) have higher correlation coefficient (>0.9). Although the correlation between samples from other group is not too much lower (~0.8). Similar to microarray experiment, we do not expect between group difference for most Probes. SeqMonk Analysis Page 24

25 Plot BoxWhisker Plot Plot BoxWhisker Plot to assess the overall distribution of each individual sample 1. Go to Plots, then select Box Whisker Plot, follow by Visible Data Stores The BoxWhisker Plot shows very even distribution among the samples, which indicates that the normalization process was appropriate. SeqMonk Analysis Page 25

26 Plot Scatter Plot Plot Scatter Plot to assess the relationship between the two replicate sets 1. Go to Plot, then select Scatter Plot On the new window, Plot ABC vs. GCB 3. Mouse over each point to see its gene symbols SeqMonk Analysis Page 26

Plot MA Plot Plot MA Plot to show how well the normalization works 1. Go to Plots, then MA Plot... 2.

27 Plot MA Plot Plot MA Plot to show how well the normalization works 1. Go to Plots, then MA Plot The MA Plot shows the data center horizontally on the zero level, which indicate a successful normalization. Note: MA Plot shows the difference vs. average between ABC and GCB. The difference is plotted on the Y-axis, and the average on the X-axis. What we want to see is that the same different is exhibited through out the different data range. SeqMonk Analysis Page 27

Statistical Test & FDR Perform statistical test to identify genes that are significantly difference between the two replicate sets: ABC vs. GCB 1.

28 Statistical Test & FDR Perform statistical test to identify genes that are significantly difference between the two replicate sets: ABC vs. GCB 1. Go to Filtering, then select Filter by Statistical Test, follow by Intensity Difference... On the new window, do the following: 2. On From Data Store / Group, select ABC 3. On To Data Store / Group, select GCB 4. On P-value must be below = On Apply Multiple Testing Correction: check 6. Click Run Filter 7. On the new window Found XXX probes, give the gene list a meaningful name 8. The gene list will show up on the List Panel, under Probe Lists. In this case, we found 748 statistically significant genes. SeqMonk Analysis Page 28

Annotate Significant Gene List For the newly created gene list, we will next perform annotation to give biological meaning to the list 1. Go to Reports, then select Annotated Probe Report.

29 Annotate Significant Gene List For the newly created gene list, we will next perform annotation to give biological meaning to the list 1. Go to Reports, then select Annotated Probe Report... On the new window Annotated Probe Report Options 2. On Annotate with select overlapping and gene 3. Set Exclude on unannotated probes 4. Click OK to proceed Note: We did not use mrna as annotate choice here, becuase it will return gene isoforms information. Instace, we have chosen to use gene which will collapse all the isoforms into a single easy to handle entry. SeqMonk Analysis Page 29

30 Examine the Signficant Table The annotated table contains all the biological information about the gene list. The table can be sorted using the Diff p-value column to further refine the list. Note that this column is FDR (False Discovery Rate) corrected p-value. The table can be exported as text file and manipulated further in Microsoft Excel. SeqMonk Analysis Page 30

31 ChIP-seq Analysis Strategy ChIP-seq experiment is designed to identify the protein binding site on the genome. In this case, the authors use ChIP-seq to locate the binding site for STAT3 protein, a Transcription Factor (TF). TF binds to the upstream region of a Transcription Start Site (TSS), and activate the expression of that gene. Sometimes, TF binds to other regions of the gene; such as inside and downstream of the gene boundary. To identify the potential STAT3 regulated genes, we have device a strategy to Define Probe around the TF binding site and quantitate RNA-seq reads in the defined Probes. As shown below, the strategy is an attempt to capture gene expression signal surrounding the TF bindng site. We have arbituary pick 2000 base-pair up- and down-stream of the TF binding to define our probe for read quantitation purposes. SeqMonk Analysis Page 31

32 ChIP-seq Define Probe Caveat There are some caveats using our strategy to identify STAT3 regulated genes. First, the defined Probe region might include more then one gene which can introduce complications (top figure). Second, the defined Probe region might not be large enough to capture the full extend of the gene (bottom figure). SeqMonk Analysis Page 32

Import ChIPSeq Coordinates Before we can use the TF binding sites derived from ChIP-seq to define our Probes, we will import the coordinates for this binding site. 1.

33 Import ChIPSeq Coordinates Before we can use the TF binding sites derived from ChIP-seq to define our Probes, we will import the coordinates for this binding site. 1. Go to File, then select Import Annotation, follow by Text(Generic) Locate and select the TF binding site coordinates file: STAT3_ChIPSeq_Genes.txt 3. Click Open to import the file SeqMonk Analysis Page 33

Set ChIPSeq Coordinates To import a generic table, SeqMonk requires us to explicitly show it the column identifies. 1. In Start at Row, select 1 since the data starts from row number one 2.

34 Set ChIPSeq Coordinates To import a generic table, SeqMonk requires us to explicitly show it the column identifies. 1. In Start at Row, select 1 since the data starts from row number one 2. In Chr Col (Chromosome Column), select 2 for the chromosome column 3. In Start Col (start of genomic region), select 3 for the beginnig of the genomic region (or TF binding site) 4. In End Col (end of genomic region), select differenc 4 for the end of the genomic region. That is all we need to provide SeqMonk to import the table. SeqMonk Analysis Page 34

Define Probe Using STAT3 Peaks Now that we have imported the TF binding site coordinate (see List Panel under Annotation Sets for STAT3 ChIPSeq Genes.txt we can use them to define our Probes 1.

35 Define Probe Using STAT3 Peaks Now that we have imported the TF binding site coordinate (see List Panel under Annotation Sets for STAT3 ChIPSeq Genes.txt we can use them to define our Probes 1. Go to Data, then select Define Probes... In the new Define Probes... window 2. Select the Feature Probe Generator 3. In Feature to design around, choose STAT3 ChIPSeq Gene.txt 4. In Remove exact duplicate: check 5. In Ignore feature strand information: check 6. Select Over feature, and select From to Click Create Probes Warning... One of the major limitation of SeqMonk is that it can only store one set of Probes. Therefore, when a new set of Probes is being defined here, the old set will be removed. 8. Click Yes to acknowledge the removal of the old set of Probes SeqMonk Analysis Page 35

Probe Quantitation Once we have set up the Probe Defintion, we are now ready to quantify the reads within those Probes 1. Select Read Count Quantitation In the new Define Quantitation window 2.

36 Probe Quantitation Once we have set up the Probe Defintion, we are now ready to quantify the reads within those Probes 1. Select Read Count Quantitation In the new Define Quantitation window 2. In Count reads in strand, select All Reads 3. In Correct for total read cont: check 4. In Correct to what? chose Largest DataStore 5. In Count total only within probes: check 6. In Correct for probe length: check 7. In Log Transform Count: check 8. Click Quantitate to proceed SeqMonk Analysis Page 36

Statistical Test & FDR Similar to RNA-seq analysis, once we have defined the Probes we are able to perform statistical test to identify differentially expression Probes 1.

37 Statistical Test & FDR Similar to RNA-seq analysis, once we have defined the Probes we are able to perform statistical test to identify differentially expression Probes 1. Go to Filtering, then select Filter by Statistical Test, follow by Intensity Difference... In the new Intensity Difference Filter window 2. In From Data Store / Group, select the replicate set ABC 3. In To Data Store / Group, select the replicate set GCB 4. In P-value must be below = In Apply Multiple Testing Correction: check 6. Click Run Filter to begin the test 7. In the Found XXX probes window, give it a meaningful name for the list. 8. The newly created significant Probes list will show up on the List Panel under Probe Lists SeqMonk Analysis Page 37

Annotate STAT3 Regulated Genes The newly created potential STAT3 regulated genes is annotated to give biological meaning to the list 1. Go to Report, then select Annotated Probes Report.

38 Annotate STAT3 Regulated Genes The newly created potential STAT3 regulated genes is annotated to give biological meaning to the list 1. Go to Report, then select Annotated Probes Report... In the new Annotated Probe Report Options window 2. In Annotate with, select closest and gene 3. In Annotation distance cutoff type 10,000 bp 4. Select Exclude, unannotated probes 5. Select Include, data for currently visible stores 6. Click OK to proceed SeqMonk Analysis Page 38

Potential STAT3 Regulated Genes The annotated table contains all the biological information about the gene list. The table can be sorted using the Diff p-value column to further refine the list.

39 Potential STAT3 Regulated Genes The annotated table contains all the biological information about the gene list. The table can be sorted using the Diff p-value column to further refine the list. Note that this column is FDR (False Discovery Rate) corrected p-value. The table can be exported as text file and manipulated further in Microsoft Excel. Note: * (Asterisk) represent genes reported in the paper we found in our analysis SeqMonk Analysis Page 39