Leveraging biological replicates to improve analysis in ChIP-seq experiments

Size: px
Start display at page:

Download "Leveraging biological replicates to improve analysis in ChIP-seq experiments"

Transcription

1 , CSBJ Leveraging biological replicates to improve analysis in ChIP-seq experiments Yajie Yang a,b, Justin Fear a,b, Jianhong Hu c, Irina Haecker d, Lei Zhou a, Rolf Renne a,b,e, David Bloom a, Lauren M McIntyre a,b,* Abstract: ChIP-seq experiments identify genome-wide profiles of DNA-binding molecules including transcription factors, enzymes and epigenetic marks. Biological replicates are critical for reliable site discovery and are required for the deposition of data in the ENCODE and modencode projects. While early reports suggested two replicates were sufficient, the widespread application of the technique has led to emerging consensus that the technique is noisy and that increasing replication may be worthwhile. Additional biological replicates also allow for quantitative assessment of differences between conditions. To date it has remained controversial about how to confirm peak identification and to determine signal strength across biological replicates, particularly when the number of replicates is greater than two. Using objective metrics, we evaluate the consistency of biological replicates in ChIP-seq experiments with more than two replicates. We compare several approaches for binding site determination, including two popular but disparate peak callers, CisGenome and MACS2. Here we propose read coverage as a quantitative measurement of signal strength for estimating sample concordance. Determining binding based on genomic features, such as promoters, is also examined. We find that increasing the number of biological replicates increases the reliability of peak identification. Critically, binding sites with strong biological evidence may be missed if researchers rely on only two biological replicates. When more than two replicates are performed, a simple majority rule (>50% of samples identify a peak) identifies peaks more reliably in all biological replicates than the absolute concordance of peak identification between any two replicates, further demonstrating the utility of increasing replicate numbers in ChIP-seq experiments. Introduction 1 The goal of chromatin immunoprecipitation (ChIP) experiments is to map the binding sites of a molecule (usually a protein) across the genome in a cell type or tissue [ ]. ChIP assays start by cross-linking cellular interactions between DNA and the bound molecules with formaldehyde. The cross-linked chromatin is sheared into small fragments by sonication and the DNA-protein complexes of interest are recovered using specific antibodies, resulting in an enrichment of DNA fragments that were bound by the protein of interest. The cross-linking is then reversed and DNA fragments are released from the binding complex to be assayed. Usually there is a PCR amplification step to increase the amount of starting DNA. The first genome-wide ChIP studies used microarray (ChIP-chip) to analyze the DNA fragments [2,3], which can now be sequenced directly (ChIP-seq) using massive parallel sequencing [4-6]. adepartment of Molecular Genetics and Microbiology, University of Florida, Gainesville, Florida, USA buf Genetics Institute, University of Florida, Gainesville, Florida, USA chuman Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA ddepartment of Applied Entomology, University of Giessen, Giessen, Germany euf Shands Cancer Center, University of Florida, Gainesville, Florida, USA * Corresponding author. Tel.: ; Fax: address: mcintyre@ufl.edu (Lauren M McIntyre) Different patterns of peaks will form at putative binding sites after the sequence reads are aligned to a reference genome. Peaks produced by site-specific binding of transcription factors are very narrow, while peaks of specific histone modifications are more diffusive and can cover large domains of DNA across several nucleosomes [7-9]. These two distinct types of binding are termed as point source and broad source, respectively. RNA polymerase II is an example of mixed source factors, which can form both highly localized and spreading peaks at different genome positions [ 0, ]. In addition to sequences truly associated with the molecule of interest, random background noise is also present due to non-specific binding or biases in library construction and sequencing [ 2] [ 3-6]. Peak placement depends upon the background in each independent experiment. The use of control samples may mitigate these biases but cannot eliminate all sources of noise. Replication is necessary to separate actual biological events from variability resulting from random chance [ 0, 8]. Technical replication measures a single biological sample repeatedly and allows estimation of the variability in the sequencing process. Biological replication measures multiple biological samples independently and enables inferences about the biological activity of the broader population where the samples are drawn. Biological replicates and their advantage over technical replicates have been well described in the context of gene expression studies such as microarrays (e.g. [ 9-22]) and mass spectrometry [23], and more recently in RNA-seq experiments [24,25]. For ChIP-Seq experiments, with the ease of multiplexing and the plummeting costs of sequencing, increased sample sizes (i.e. number of replicates) are not only more affordable but are also becoming standard practice. For example, the ENCODE consortium requires a minimum of two biological replicates in ChIP experiments [26]. Supporting information for this article is available at

2 2 There is not yet consensus on how to analyze multiple-replicate ChIP-seq samples (Table ). Pooling biological replicates is common in current protocols of ChIP-seq experiments. In some cases multiple biological samples were pooled and then divided into aliquots before sequencing [ 2]. Other investigators sequenced the biological replicates separately but pooled the sequencing data together before proceeding to data analysis [6, 8,27,28]. Pooling replicates is also integrated into the ENCODE framework [29], where the replicates were first analyzed separately to determine the Irreproducibility Discovery Rate (IDR) [30], and then pooled together for identification of the peaks passing the IDR. IDR combines pairs of replicates. However, IDR has many limitations. For the bivariate model of IDR, the preliminary peaks have to contain both high quality peaks and peaks that are most likely to be only noise, and the algorithm is currently implemented for only a few peak callers such as SPP [3 ] and MACS [32], with the caveat that the IDR developer has not optimized for MACS and recommends against it. However, investigators may prefer peak callers optimized for the binding factor of interest. The more stringent peak callers such as CisGenome [33] and QUEST [34] are not currently configured in the IDR package. Moreover, IDR relies on the ranking of the preliminary peaks and does not handle ties in the ranks, while such ties are common in ChIP-seq peaks. A true signal may be dropped by IDR when one replicate is noisier, because IDR chooses signals with consistent ranking over the signals that rank high in one replicate but low in the other. In this scenario, weak signals with consistent ranking between replicates are considered more credible than signals that were strong in one but weak in the other (inconsistent ranking). In genomic experiments, independent processing of biological replicates is standard. Combined data may be unduly influenced by an outlier sample. Detection rates are also reduced, with binding sites with smaller signal-to-noise ratios being especially affected. However, detection is critical in ChIP-seq experiments for investigators who want to obtain maximal information. Another severe limitation of analyzing a single combined sample is that it precludes downstream quantitative comparisons across samples. Recently attention has been drawn to analyzing individual samples separately in ChIP-seq experiments [9,35-4 ]. Some groups have proposed to focus on the analysis of one replicate, using the additional samples for confirmation only [42]. Others have compared overlapping peaks from biological replicates for transcription factor occupancy [4,43], ChIP-seq quality control [44], and study of cell cycle phases [45]. Still, there is no consensus about how to leverage information provided by biological replicates. In this study, we analyzed five ChIP-seq experiments with three or more replicates. Multiple methods for defining the consensus peaks using biological replicates were considered in order to minimize variability and maximize consistency. We confirm results from genomic studies and conclude that more than two biological replicates are essential for ChIP-seq experiments. We propose using a simple majority rule for peak identification and show that this yields more reliable peaks than absolute concordance with fewer replicates. Methods We used five ChIP-seq data sets for this study. Two are previously unpublished and created in our labs. The raw data (fastq files) of the other three were downloaded from Gene Expression Omnibus (GEO). a) RNA Polymerase II ChIP-seq in Drosophila melanogaster with three replicates, and one input DNA control (GEO accession: GSE36 07). b) Transcription factor NFKB ChIP-seq [46] (GEO accession: GSE 9485) in human lymphoblastoid cell line GM The cells were stimulated with TNF-α to activate NFKB regulation. This experiment consisted of five biological replicates and two IgG control samples. c) FOXA ChIP-seq in mouse liver with five biological replicates and three input control samples [47,48] (GEO accession: GSE25836 and GSE33666). d) H3K4me3 ChIP-seq in Drosophila melanogaster with three biological replicates and three input control samples (unpublished). e) H3K27me3 ChIP-seq in mouse ganglia with three biological replicates, and no input control (unpublished) Biological replicates from each dataset were individually processed and underwent three levels of quality control (Figure ). The fastq files were mapped to the genome (FlyBase 5.30 for drosophila, mm9 for mouse, and hg 9 for human) using Bowtie [49] with options m best strata. Aligned reads were visualized in Integrative Genomics Viewer (Broad Institute) [50,5 ] to check the overall read distribution shape and signal strength of the factor and the control at individual loci. Although not a quantitative metric, visible enrichment at known binding regions are expected in a successful ChIP-seq experiment. The PCR bottleneck coefficient (PBC) was calculated to measure approximate library complexity by taking the ratio of nonredundant uniquely mapped reads over all uniquely mapped reads. All the quality metrics based on the reads themselves and the initial alignments are QC. Peak identification from noisy ChIP-seq data is a challenging process, for which over 30 programs have been developed (for a review see [ 7]). In this study, we used two of the most popular peak callers, MACS2 [32] and CisGenome [33], which were found to perform better than other peak callers [ 2,30]. These two algorithms are also representative of statistical models used for peak finding: MACS uses a dynamic Poisson distribution, while CisGenome uses a

3 3 negative binomial distribution to account for the local biases across the genome. Both programs were run with default settings with the input DNA samples as the control (except the H3K27me3 dataset for which the input control is unavailable). Notably, the default setting of MACS2 removes duplicate tags at the same location ( keepdup=auto) and report peaks with FDR <0.05 (-q 0.05), while CisGenome does not automatically remove duplicates by default, and the cutoff for peak identification is a fold of enrichment >3 (-c=3.0) when a input control is used and > 0 (-c= 0) when the ChIP sample is analyzed alone. Figure 1. Analysis pipeline for ChIP-seq experiments. Each biological replicate is individually aligned to the appropriate reference (Aln), Peaks are identified (e.g. CisGenome or MACS). Quality control 1 (QC1) includes visual examination in a genome browser and quantification of total reads, uniquely mapped reads, and PCR bottleneck coefficient (PBC). Quality control 2 (QC2) includes evaluation of the number of peaks, the fraction of reads in peaks (FRIP), phantom peaks and common and unique peaks. Consensus peaks summarized from overlapping peaks with four different criteria (described in Methods and Figure 2). Quality Control 3 (QC3) examines correlation and agreement across replicates. Biological replicates in ChIP-seq Additional settings were explored. For the H3K27me3 data, we also present analysis results when removing duplicate tags first and using c=6 besides those generated by the default setting. Parameter choices are important and investigators should spend time adjusting the parameters in order to obtain a reasonable list of binding sites for their factor of interest. Our intention here is not to compare the peak callers themselves but to use disparate peak callers with disparate settings and diverse data to see if there are universal conclusions about processing biological replicates that can be made. QC2 is performed after peak identification and included summarizing the number of peaks identified as well as metrics to evaluate peak quality. The fraction of reads in peaks (FRIP, [33]) was calculated to estimate the global enrichment of signals against the background. Normalized strand cross-correlation (NSC) and relative strand cross-correlation (RSC) measure enrichment independently of peak calling. NSC is the normalized ratio between the fragmentlength cross-correlation peak and the background cross-correlation. RSC is the ratio between the fragment-length peak and the readlength peak ( html). For peaks independently identified from multiple replicates, it is unlikely that the exact peak position is the same across independent replicates. Peaks were considered overlapping among replicates if at least one nucleotide was shared. Unique and common peaks were identified across replicates. Peaks found only in a single replicate were considered unique. Peaks present in all replicates were considered to be common. The simple agreement coefficient was calculated as the number of overlapping peaks over all peaks identified in a pair of replicates. McNemar s test [56] evaluates the symmetry of identification for unique peaks, providing a measurement of agreement between replicates. We explored several different ways to define a consensus region from peaks overlapping among a set of replicates with various exact positions (Figure 2). We compared: the maximum area encompassing identified peak regions ( MAX ); the area between the summits of overlapping peaks ( SMT ); the area encompassing the known footprint size for a specific binding molecule centered at the average summit ( ASF ), or using an empirical observation of average peak width to determine the boundaries again centered at the average summit ( ASW ). If peaks were identified only in a subset of replicates, the consensus peaks were determined from the subset where individual peaks had been identified. For each of these approaches, the coverage in consensus peaks was calculated as the Reads Per Kilobase per Million mapped reads (RPKM, [52]) for each sample. QC3 was developed to quantitatively evaluate the agreement across replicates. Consistency between pairs of replicates was explored using weighted Kappa coefficients [53] of ranked coverage (groups=5) and Spearman s correlation. Bland-Altman plots were also used to visually examine differences between the two replicates plotted against their mean [54,55]. In many cases peaks were present in all replicates, but there are also cases where peaks were only identified in a subset of replicates. We proposed a simple majority rule and considered a peak identification to be consensus if it was detected in a majority of replicates, based on the reasoning that ( ) if peak detection were random the likelihood of seeing a peak in the same location in multiple replicates would be small, and (2) given the noisy nature of ChIP-seq samples, a particular tool s chance of not identifying a peak in a region (false negative) is known to be large (Supplemental Figure 7). As the sample size of a ChIP-seq experiment increases, requiring an absolute consensus ( 00% agreement) will increase the false negative rate substantially. The majority rule allows for the simple extension of consensus between two replicates (the guideline proposed by [26]), to more complex situations. A majority consensus peak is supported by the majority of samples, allowing possible dissent in the other replicates. Naturally, this introduces the question of reliability of the peaks that have not been called unanimously. To determine whether the missing peak in some of the replicates was due to the lack of reads or merely a potential false negative from the peak discovery software, we tested for evidence that reads were enriched in the replicates where the software failed to identify them initially. For each sample, we used the peaks identified in that sample to estimate the distribution of RPKM values for peaks in that particular sample. RPKM values for peaks less than the 25 th percentile were considered the background. We used a Z-test where the null hypothesis is that its RPKM was not greater than the background. The peak was considered to be detected above background (DABG) when the null hypothesis was rejected (i.e. RPKM of the peak was greater than the 25 th percentile of the RPKM of all peaks of that sample).

4 Figure 2. Defining the consensus regions for overlapping peaks across replicates. (A). Scheme showing different methods of combining individual peaks into a consensus. MAX: the maximum area encompassing all peak regions. SMT: the area between the summits of peaks. Summits of individual peaks are marked in red. The average summit of individual peaks is shown as the star. ASF: the area in the size of the footprint of the bound protein with the average summit as the center. ASW: the area centering the average summit in the size of the average peak width. (B) Snapshot of signals (grey bar charts on top), algorithmically identified peaks (black) and the consensus regions (blue) for point source factors that form narrow peaks at the transcription start site (TSS). The ChIP signals are distinct compared to the input control. The outlooks of the signals are highly similar for all five replicates when the signal range is not set but allows autoadjustment to the local background (not shown). Here the range is set to a constant to allow comparison of the relative signal strengths, which vary across samples. The peaks identified in individual samples are similar in their position and width. (C) Snapshot for broad source factors whose binding signals span an entire gene (cropped at the 3 end for readability). There are bigger differences in the identified peaks across replicates. 4 The Gene Feature Format (GFF) file containing the genomic annotation of D. melanogaster was downloaded from: ftp://ftp.flybase.net/genomes/drosophila_melanogaster/dmel_r5.3 0_FB20 0_07/. The promoters were defined as +/-2kb from the TSSs. The genic regions were taken as the upstream 2kb from the TSSs until the downstream 2kb from the transcript terminate sites (TTSs). Agreement between the RPKM of pairs of replicates was inspected using Bland-Altman plots for both promoters and genic regions. Results For all of the experiments we examined, the read level QC showed that the sequencing depth and quality varied among replicates (Supplemental Table and 2). Sufficient numbers of total reads and uniquely mapped reads were necessary for binding site discovery. The RNAPII data met the rule of thumb promoted for the minimal mapped reads per sample, which is 2 million for drosophila, and 0 million for mammalian genome [26]. Under this rule the FOXA and NFKB experiments appeared to lack sequencing depth. The first replicate of the H3K4me3 data had much fewer reads compared to the other replicates. Consistent with their biological functions, the binding signals of RNAPII and H3K4me3 were associated with genic regions with more prominent peaks near the transcription start sites (TSSs) (Supplemental Figure ). Clear and narrow peaks were found at the TSSs of known NFKB targets such as TP53 [57,58], NFKBIA [59,60], NFKB [6 ] (Supplemental Figure ) and SHH [62]. QC2 revealed that the numbers of peaks independently identified were different for replicates of the same experiment (Supplemental Table ) and the difference between peak calling programs was evident. The performance of same parameter settings depended upon the particular experiment, and there was not an immediately transparent mapping between the two underlying models of MACS2 and CisGenome. Using default settings, MACS2 [32] identified more peaks in the RNAPII data while CisGenome [33] identified more in other datasets. CisGenome peaks were also wider, especially for the NFKB data. Multiple consecutive peaks identified by MACS2 in RNAPII were frequently identified as a single peak by CisGenome (Supplemental Figure ). The fraction of reads in peaks (FRIP) varied corresponding to the number of peaks being identified (Supplemental Table ). Parameter exploration demonstrated the differences between MACS2 and CisGenome in the default settings beyond the underlying statistical models (Poisson vs. negative binomial). For example, the plentiful redundant reads in low PBC samples have to be removed deliberately for CisGenome but are

5 5 automatically removed in MACS2. When this step was repressed in MACS2 by the --keep-dup option, the number of peaks became comparable to that identified by CisGenome for RNAPII and NFKB (data not shown). When redundant reads were removed, the number of peaks identified by CisGenome and the FRIP dropped noticeably and was closer to that of the default settings in MACS2 (Supplementary Table 3; Supplementary Table 4). Peak-independent measurements of enrichment such as Normalized strand crosscorrelation (NSC) and relative strand cross-correlation (RSC) suggested three of the NFKB replicates were of medium quality, and the remaining samples were of high or very high quality (Supplemental Table 2). Without prohibitively costly independent validation experiments, the rate of false positive and false negative peaks cannot be accurately estimated. However, consistency of replicates provides a proxy for such an estimate, as the general assumption is that peaks identified in multiple samples, in approximately the same region, represent the same protein/dna binding phenomenon. As showed by the peak level QC2, despite discrepancies in the number of peaks identified by CisGenome and MACS2 in individual replicates, the numbers of common peaks were more comparable between the two programs (Table 2; Supplemental Table 3). The proportion of overlapping peaks between a pair of replicates reflects sample agreement, which was fair for the RNAPII and NFKB data (Supplemental Table 3a). The agreement was reasonable for the H3K27me3 data when MACS or adjusted CisGenome was used, but decreased when the peaks were identified using the default settings of CisGenome (Supplemental Table 3a). For H3K27me3 dataset, we focused on the results from adjusted instead of the default settings of CisGenome. Similarly, the default CisGenome also did not perform well for the H3K4me3 data. This was probably because CisGenome, unlike MACS2, was not optimized for histone signals (broad peaks). The FOXA data also had few reproducible peaks across replicates. Biological replicates in ChIP-seq Compared to the other datasets, the FOXA data appeared noisier in the genome browser and we were not able to observe noticeable peaks near known selected FOXA target genes. The metric we proposed (proportion of overlapping peaks) and the existing metrics (sequencing and mapped reads) all suggest high background noise in these data. The researchers in the original report combined the five replicates into one sample prior to analysis. Generally, the number of peaks increases with the number of sequence reads for both CisGenome and MACS2 (Supplemental Table ), consistent with previous studies [ 0]. McNemar s test [56] demonstrates that the unique peaks do not match for a given pair of replicates, with more peaks being identified in samples with greater sequencing depth (Supplemental Table 3b). However, this pattern was not strictly followed by the samples with high PCR bottleneck coefficient values (PBC>0.7). Read coverage within specific peaks provides a quantitative measurement of enrichment above background. We calculated the Reads Per Kilobase per Million mapped reads (RPKM, [52]) in the consensus regions for common peaks (defined in Methods). Because differently defined consensus regions mostly varied in width (Figure 2), the choice of consensus region affected read coverage and in turn the estimate of sample agreement, though this effect was small (Figure 3; Supplemental Figure 2). ASF consensus peaks had relatively lower agreement across replicates, indicating that ASF is not a good choice of consensus despite its usage of biological knowledge of a protein s footprint size. It has been reported that although factors bind short regions of DNA (typically 5 25 bp), the DNA fragments that are pulled down typically cover a wider region of bp around the binding site [ 3]. Therefore the width of identified peak regions does not always reflect the actual resolution of biological binding size. We also examined the enrichment in the corresponding regions of peaks identified in the replicate with the most reads. This is comparable with other ChIP-seq studies that arbitrarily selected one replicate as the reference sample (e.g. [42]). Unsurprisingly, such consensus peaks were heavily biased towards the sample that was selected as the standard (Supplemental Figure 2). For RNAPII and NFKB, CisGenome called fewer peaks that had higher agreement across replicates (Supplemental Figure 3: BA plots with a narrower Y-axis where points are symmetrical around 0, higher Kappa and Spearman s coefficient), indicating these peaks were of higher quality. These peaks were also wider, including more reads that covered broader regions. In the H3K4me3 data, MACS2 identified fewer but higher quality peaks compared to CisGenome. The first replicate of H3K4me3 data was less correlated with the other replicates (Supplemental Figure 4), possibly an outlier, which was hinted by its lower read counts. The adjusted CisGenome and MACS2 yielded comparable Kappa and Spearman s coefficients for the H3K27me3 data. However, the distribution of the BA plots indicated that CisGenome peaks have better agreement (Supplemental Figure 6). Despite the difference in the number of identified peaks, the RNAPII, NFKB and H3K27me3 replicates were highly correlated in terms of signal quantification (Figure 3; Supplemental Figure 5; Supplemental Figure 6). QC based on sequencing depth (QC ) and peak calling results (QC2) may identify the third replicate of NFKB experiment as failed; however, when measured quantitatively (QC3), it actually had good agreement with other samples (Supplemental Figure 5).

6 Figure 3. Consistency across replicates of the RNAPII ChIP-seq experiment. (A) Boxplot of weighted Kappa coefficients. The coverage in the consensus peak was binned into five ranked groups. The agreement of such ranked coverage between replicates was reflected by the weighted Kappa coefficients. A value over 0.75 indicates excellent agreement, which was met for all replicates regardless of the consensus being used. (B) Heat map of the Spearman correlation of the coverage in the consensus peak. Correlations were high. (C) Bland-Altman plots show the relationship between the difference (Y axis) and the mean (X axis) for a pair of replicates. Narrow and symmetrical plots reflect better agreement. Replicate 2 and replicate 3 are shown here, but other pairs (Replicate 1 vs Replicate 2, Replicate 1 vs Replicate 3) have similar patterns. Data shown are based on CisGenome peaks and more information is in Supplemental Figure 3. 6 Figure 4. Percentages of peaks detected above background (DABG) in replicates where no algorithmically identified peaks were present. The read coverage (RPKM) in each identified peak, unique or common, was compared to the lower quartile of coverage in all peaks for that sample. The peak was detectable if the difference was statistically significant by a Z test. Peaks that were identified in the majority of replicates had a higher ratio to be confirmed by DAGB compared to those were unique in one replicate (Supplemental Table 3. The Y axis is the percentage of the peaks DABG and the mean is indicated by the sold line while the whiskers are the 25 and 75 percentile values. Figure 5. Spearman correlation coefficients were similar when the peaks were identified in all replicates or in the majority of the replicates. However, the correlation was much lower for uniquely identified peaks. The Y axis is the correlation coefficient and the mean is indicated by the sold line while the whiskers are the 25 and 75 percentile values.

7 Figure 6. Bland-Altman plots showing the sample agreement, using genomic features as the quantification unit. The difference (Y axis) between a pair of replicates at the genomic feature (transcript for RNAPII [A] and TSS for H3K4me3 [B]) was plotted against the average of two samples. (A) Enrichment in the transcripts showed agreement for all replicates of the RNAPII data. (B) The first replicate of H3K4m3 appears to be an outlier sample, with little agreement with other replicates, while the second and third replicates agreed with each other in their enrichment near the TSS. 7 Due to the noisy nature of ChIP experiments and limitations of peak calling programs, peak identification varies across samples. Requiring support from all replicates for common peaks is likely to increase the false negative rate. We hypothesized that if a peak was identified in more than 50% of the replicates (i.e. two out of three, three out of five) there is sufficient support for its existence. More peaks were included as common under this majority rule (Table 2 Common in the majority ). We tested whether the failure to identify a peak in some replicates is likely to be a false negative or whether there is no enrichment of binding in that area for that replicate. The probability of detection above background (DABG) was used to determine whether the observed signal in the putative peak region was greater than the first quartile of detected peaks in that sample (Z test p<0.05, see Methods). Visual inspection using the genome browser found clear peaks at the TSS of known NFKB targets such as TP53 [57,58], NFKBIA [59,60], NFKB [6 ] and SHH [62], though these peaks were not identified in all replicates by CisGenome or MACS2 (Supplemental Figure ). In addition, there were also distinct increases of signal near the TSS of BRCA2 and PTEN, both of which are known targets of NFKB [63,64] but were not identified as peaks (Supplemental Figure 7). The absence of peaks identified at these regions may be the result of insufficient coverage or excessive noise at these genome positions. Compared to the absolute consensus, more peaks were included as common under the majority rule (Table 2 Common in the majority ). For the RNAPII data, peaks that were identified in the majority of replicates had a high confirmation rate using the test for detection above background (DABG) particularly when compared to tests for DABG for unique peaks, regardless the peak caller used or the consensus definition (Figure 4; Supplemental Table 4). Similarly in the H3K27me3 data, the DABG was 55% - 58% in the other replicates for the peaks identified solely in the third replicate, but increased to 8 % - 85% when the peaks were also identified in an additional replicate. More than 92% of unique peaks in NFKB s first replicate were also supported by other replicates. This suggests that many genuine signals were missed by the peak callers. Consistent with the QC and QC2, peaks identified only in the third and fourth replicates of the NFKB data, were significantly above background only in % and 25% of the other replicates. When the majority rule was used, 00% of the peaks were also identified by DABG in the additional two replicates. DABG thus enables additional quality assessments, and an objective measure of whether peaks identified by the majority rule have supporting evidence in all replicates.

8 8 Spearman s correlation between pairs of replicates was high, as expected, when using peaks that were identified by the peak callers in all replicates. The correlation was only slightly lower when the peaks that were identified in the majority were also included (Figure 5 showing RNAPII; Supplemental Table 6). However, when only one replicate was required for peak identification, the correlation in enrichment among replicates dropped dramatically (Figure 5 showing RNAPII; Supplemental Table 6), indicating that peaks identified in the majority of replicates were comparable to the common peaks, both of which were much more reliable than those identified in one replicate. The performance of different methods for determining consensus peaks was dependent upon the mode of molecular binding, data quality and peak caller used. For the data we examined, MAX, SMT and ASW consensus peaks yielded a high estimate of consistency for point and mixed source factors. It was less conclusive for the broad source factors. Genomic features may serve as a reasonable alternative as quantification unit for well annotated genomes. For example, based on the biology that H3K4me3 marks are associated with TSSs, sample consistency can be inferred by inspecting the read coverage at TSSs. Even for factors whose functions are less defined, the regulation of many proteins are gene centric, therefore the binding strength in the nearby genic regions may provide a measure of the biological activity. We calculated the coverage in the surrounding regions of TSS for the H3K4me3 data and coverage in the transcripts for the RNAPII data. Enrichment in the TSS surrounding regions was in good agreement for the second and third replicates of the H3K4me3 data (Figure 6). Consistent with other measures, the first replicate of H3K4me3 seems to be an outlier sample. The enrichment in the transcripts was in good agreement for all replicates of the RNAPII data (Figure 6). Discussion Noise may be introduced during many steps of ChIP. Some may be technical issues in IP, library construction, or sequencing. Other noise may be due to biological differences among individual samples. As the tissue specificity of transcription factor binding and DNA modification has been demonstrated by the ENCODE project, we also expect that the tissue samples are more heterogeneous than the cell lines, which may be more heterogeneous than prokaryotes. The noise makes peak identification from ChIP-seq data a challenging task and demands some guidelines for considering all the sources of variability. Towards this end, we analyzed three publically available ChIP-seq data, and two of our own datasets with three or more biological replicates. Consistent with expression profiling techniques, we find that more replicates produce results that can be quantitatively as well as qualitatively evalauted. We propose that ChIP experiments should include at least three replicates and use the consensus peaks found in a majority of samples. Peaks common in all samples and peaks unique to a single sample can be used as an indicator of individual sample quality. Deeply sequenced experiments, such as the RNAPII data in this study, had better concordance among replicates than those with lower read counts. Encouragingly, reproducible peaks could still be determined from those studies with lower coverage. Quantification of the signals in the consensus regions was consistent among replicates even when a peak was not initially identified for a particular replicate. Despite their distinct models for Biological replicates in ChIP-seq peak identification, the two different programs used in this study (CisGenome and MACS2) produced comparable quantitative measurements of consensus peaks and led to similar conclusions about the utility of replicates. Although we focused on default settings for this exercise, adjusting settings on peak callers can improve the concordance of peak identification among replicates. The real binding sites are unknown for most ChIP studies. The strategy that requires identification of a peak in all replicates (absolute consensus) will exclude genuine binding sites. The failure to detect a peak in a particular sample may be due to low coverage or high background at a particular peak position, in combination with the uncertainty in peak calling algorithms. A practical approach to maximize site discovery is to increase the number of replicates. We showed that peaks that were identified in the majority of replicates were likely to be enriched above background in the replicates where the initial peak calling process had failed. When more than two replicates were examined, many peaks that would be considered unqiue in the pair of replicates were confirmed in an additional replicate. Peaks identified in the majority (>50%) of replicates were frequently confirmed in the missing replicates when they were specifically tested for detection among background, while the confirmation rate for unique peaks were much lower, suggesting these majority peaks were more likely to be true positives. Equally importantly, no single replicates were the source of most discrepancies and so the inclusion of more replicates improved the number and quality of peaks for all replicates. The majority rule may be applied to other IP-seq studies. Twice as many microrna binding sites were identified from two out of three replicates than from all three replicates using high-throughput sequencing of RNA isolated by crosslinking immunoprecipitation (HITS-CLIP) technology [65]. Real target sites may not recur uniformly across replicates above background as defined by a particular peak discovery algorithm. Annotation-based approaches provide quantification that is independent of peak calling. They are complementary to peak identification for promoter/transcript-associated protein binding, or can be employed when peak calling is difficult. Notably, they cannot replace peak callers, as many binding sites would be missed, as it has been demonstrated by previous ChIP experiments that transcription factors, even transcription activators such as STAT [6] and E2F [66,67], can bind in regions of the genome previously unknown, though the function of the binding remains unclear. The decade-long debates on replication for microarray experiments [68] and more recently RNA-seq data [69] applies to the current discussion of ChIP-seq data. Not only is an increase in replication sensible from a statistical point of view, allowing a quantitative assessment of differences between groups, it enables identification of a higher number of reliable signals out of the noisy ChIP-seq data. The more variablity in the sample source, the more biological replicates will be necessary. More replicates provide a shield against undercalling, as a particular peak caller is unlikely to identify all peaks in all replicates. In cases where a certain peak is missing in one sample but present in other replicates, the signal in the missing sample can be estimated from other replicates and tested for detection above background in that replicate.

9 9

10 κ κ κ κ 10 Keywords: ChIP-seq, peak identification, biological replicates Competing Interests: The authors have declared that no competing interests exist Yang et al. Licensee: Computational and Structural Biotechnology Journal. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly cited.

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday 15 June 2015

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday 15 June 2015 ChIP-Seq Data Analysis J Fass UCD Genome Center Bioinformatics Core Wednesday 15 June 2015 What s the Question? Where do Transcription Factors (TFs) bind genomic DNA 1? (Where do other things bind DNA

More information

Introduction to ChIP Seq data analyses. Acknowledgement: slides taken from Dr. H

Introduction to ChIP Seq data analyses. Acknowledgement: slides taken from Dr. H Introduction to ChIP Seq data analyses Acknowledgement: slides taken from Dr. H Wu @Emory ChIP seq: Chromatin ImmunoPrecipitation it ti + sequencing Same biological motivation as ChIP chip: measure specific

More information

Green Center Computational Core ChIP- Seq Pipeline, Just a Click Away

Green Center Computational Core ChIP- Seq Pipeline, Just a Click Away Green Center Computational Core ChIP- Seq Pipeline, Just a Click Away Venkat Malladi Computational Biologist Computational Core Cecil H. and Ida Green Center for Reproductive Biology Science Introduc

More information

less sensitive than RNA-seq but more robust analysis pipelines expensive but quantitiatve standard but typically not high throughput

less sensitive than RNA-seq but more robust analysis pipelines expensive but quantitiatve standard but typically not high throughput Chapter 11: Gene Expression The availability of an annotated genome sequence enables massively parallel analysis of gene expression. The expression of all genes in an organism can be measured in one experiment.

More information

Introduction to genome biology

Introduction to genome biology Introduction to genome biology Lisa Stubbs We ve found most genes; but what about the rest of the genome? Genome size* 12 Mb 95 Mb 170 Mb 1500 Mb 2700 Mb 3200 Mb #coding genes ~7000 ~20000 ~14000 ~26000

More information

Non-Organic-Based Isolation of Mammalian microrna using Norgen s microrna Purification Kit

Non-Organic-Based Isolation of Mammalian microrna using Norgen s microrna Purification Kit Application Note 13 RNA Sample Preparation Non-Organic-Based Isolation of Mammalian microrna using Norgen s microrna Purification Kit B. Lam, PhD 1, P. Roberts, MSc 1 Y. Haj-Ahmad, M.Sc., Ph.D 1,2 1 Norgen

More information

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE? MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE? Lesson Plan: Title Introduction to the Genome Browser: what is a gene? JOYCE STAMM Objectives Demonstrate basic skills in using the UCSC Genome

More information

Analysis of Microarray Data

Analysis of Microarray Data Analysis of Microarray Data Lecture 3: Visualization and Functional Analysis George Bell, Ph.D. Senior Bioinformatics Scientist Bioinformatics and Research Computing Whitehead Institute Outline Review

More information

PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls

PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls Joel Rozowsky, Ghia Euskirchen 2, Raymond K Auerbach 3, Zhengdong D Zhang, Theodore Gibson, Robert Bjornson 4, Nicholas Carriero

More information

The ChIP-Seq project. Giovanna Ambrosini, Philipp Bucher. April 19, 2010 Lausanne. EPFL-SV Bucher Group

The ChIP-Seq project. Giovanna Ambrosini, Philipp Bucher. April 19, 2010 Lausanne. EPFL-SV Bucher Group The ChIP-Seq project Giovanna Ambrosini, Philipp Bucher EPFL-SV Bucher Group April 19, 2010 Lausanne Overview Focus on technical aspects Description of applications (C programs) Where to find binaries,

More information

Chapter 1 Analysis of ChIP-Seq Data with Partek Genomics Suite 6.6

Chapter 1 Analysis of ChIP-Seq Data with Partek Genomics Suite 6.6 Chapter 1 Analysis of ChIP-Seq Data with Partek Genomics Suite 6.6 Overview ChIP-Sequencing technology (ChIP-Seq) uses high-throughput DNA sequencing to map protein-dna interactions across the entire genome.

More information

measuring gene expression December 5, 2017

measuring gene expression December 5, 2017 measuring gene expression December 5, 2017 transcription a usually short-lived RNA copy of the DNA is created through transcription RNA is exported to the cytoplasm to encode proteins some types of RNA

More information

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1 Supplementary Figure 1 Origin use and efficiency are similar among WT, rrm3, pif1-m2, and pif1-m2; rrm3 strains. A. Analysis of fork progression around confirmed and likely origins (from cerevisiae.oridb.org).

More information

ChIPnorm: A Statistical Method for Normalizing and Identifying Differential Regions in Histone Modification ChIP-seq Libraries

ChIPnorm: A Statistical Method for Normalizing and Identifying Differential Regions in Histone Modification ChIP-seq Libraries ChIPnorm: A Statistical Method for Normalizing and Identifying Differential Regions in Histone Modification ChIP-seq Libraries Nishanth Ulhas Nair., Avinash Das Sahu 2., Philipp Bucher 3,4 *, Bernard M.

More information

Measuring transcriptomes with RNA-Seq

Measuring transcriptomes with RNA-Seq Measuring transcriptomes with RNA-Seq BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2017 Anthony Gitter gitter@biostat.wisc.edu These slides, excluding third-party material, are licensed under CC BY-NC

More information

New Statistical Algorithms for Monitoring Gene Expression on GeneChip Probe Arrays

New Statistical Algorithms for Monitoring Gene Expression on GeneChip Probe Arrays GENE EXPRESSION MONITORING TECHNICAL NOTE New Statistical Algorithms for Monitoring Gene Expression on GeneChip Probe Arrays Introduction Affymetrix has designed new algorithms for monitoring GeneChip

More information

RNA-Seq analysis using R: Differential expression and transcriptome assembly

RNA-Seq analysis using R: Differential expression and transcriptome assembly RNA-Seq analysis using R: Differential expression and transcriptome assembly Beibei Chen Ph.D BICF 12/7/2016 Agenda Brief about RNA-seq and experiment design Gene oriented analysis Gene quantification

More information

Supporting Information

Supporting Information Supporting Information Ho et al. 1.173/pnas.81288816 SI Methods Sequences of shrna hairpins: Brg shrna #1: ccggcggctcaagaaggaagttgaactcgagttcaacttccttcttgacgnttttg (TRCN71383; Open Biosystems). Brg shrna

More information

RNA-Sequencing analysis

RNA-Sequencing analysis RNA-Sequencing analysis Markus Kreuz 25. 04. 2012 Institut für Medizinische Informatik, Statistik und Epidemiologie Content: Biological background Overview transcriptomics RNA-Seq RNA-Seq technology Challenges

More information

Microarray Gene Expression Analysis at CNIO

Microarray Gene Expression Analysis at CNIO Microarray Gene Expression Analysis at CNIO Orlando Domínguez Genomics Unit Biotechnology Program, CNIO 8 May 2013 Workflow, from samples to Gene Expression data Experimental design user/gu/ubio Samples

More information

ab ChIP Kit Magnetic One-Step

ab ChIP Kit Magnetic One-Step ab156907 ChIP Kit Magnetic One-Step Instructions for Use For selective enrichment of a chromatin fraction containing specific DNA sequences in a high throughput format using chromatin isolated from various

More information

Bioinformatics of Transcriptional Regulation

Bioinformatics of Transcriptional Regulation Bioinformatics of Transcriptional Regulation Carl Herrmann IPMB & DKFZ c.herrmann@dkfz.de Wechselwirkung von Maßnahmen und Auswirkungen Einflussmöglichkeiten in einem Dialog From genes to active compounds

More information

Measuring transcriptomes with RNA-Seq. BMI/CS 776 Spring 2016 Anthony Gitter

Measuring transcriptomes with RNA-Seq. BMI/CS 776  Spring 2016 Anthony Gitter Measuring transcriptomes with RNA-Seq BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2016 Anthony Gitter gitter@biostat.wisc.edu Overview RNA-Seq technology The RNA-Seq quantification problem Generative

More information

: Genomic Regions Enrichment of Annotations Tool

: Genomic Regions Enrichment of Annotations Tool http://great.stanford.edu/ : Genomic Regions Enrichment of Annotations Tool Gill Bejerano Dept. of Developmental Biology & Dept. of Computer Science Stanford University 1 Human Gene Regulation 10 13 different

More information

SMARTer Ultra Low RNA Kit for Illumina Sequencing Two powerful technologies combine to enable sequencing with ultra-low levels of RNA

SMARTer Ultra Low RNA Kit for Illumina Sequencing Two powerful technologies combine to enable sequencing with ultra-low levels of RNA SMARTer Ultra Low RNA Kit for Illumina Sequencing Two powerful technologies combine to enable sequencing with ultra-low levels of RNA The most sensitive cdna synthesis technology, combined with next-generation

More information

Gene Identification in silico

Gene Identification in silico Gene Identification in silico Nita Parekh, IIIT Hyderabad Presented at National Seminar on Bioinformatics and Functional Genomics, at Bioinformatics centre, Pondicherry University, Feb 15 17, 2006. Introduction

More information

The ENCODE Encyclopedia. & Variant Annotation Using RegulomeDB and HaploReg

The ENCODE Encyclopedia. & Variant Annotation Using RegulomeDB and HaploReg The ENCODE Encyclopedia & Variant Annotation Using RegulomeDB and HaploReg Jill E. Moore Weng Lab University of Massachusetts Medical School October 10, 2015 Where s the Encyclopedia? ENCODE: Encyclopedia

More information

Functional Genomics Overview RORY STARK PRINCIPAL BIOINFORMATICS ANALYST CRUK CAMBRIDGE INSTITUTE 18 SEPTEMBER 2017

Functional Genomics Overview RORY STARK PRINCIPAL BIOINFORMATICS ANALYST CRUK CAMBRIDGE INSTITUTE 18 SEPTEMBER 2017 Functional Genomics Overview RORY STARK PRINCIPAL BIOINFORMATICS ANALYST CRUK CAMBRIDGE INSTITUTE 18 SEPTEMBER 2017 Agenda What is Functional Genomics? RNA Transcription/Gene Expression Measuring Gene

More information

Perm-seq: Mapping Protein-DNA Interactions in Segmental Duplication and Highly Repetitive Regions of Genomes with Prior- Enhanced Read Mapping

Perm-seq: Mapping Protein-DNA Interactions in Segmental Duplication and Highly Repetitive Regions of Genomes with Prior- Enhanced Read Mapping RESEARCH ARTICLE Perm-seq: Mapping Protein-DNA Interactions in Segmental Duplication and Highly Repetitive Regions of Genomes with Prior- Enhanced Read Mapping Xin Zeng 1,BoLi 2, Rene Welch 1, Constanza

More information

Enhancers mutations that make the original mutant phenotype more extreme. Suppressors mutations that make the original mutant phenotype less extreme

Enhancers mutations that make the original mutant phenotype more extreme. Suppressors mutations that make the original mutant phenotype less extreme Interactomics and Proteomics 1. Interactomics The field of interactomics is concerned with interactions between genes or proteins. They can be genetic interactions, in which two genes are involved in the

More information

QIAGEN s NGS Solutions for Biomarkers NGS & Bioinformatics team QIAGEN (Suzhou) Translational Medicine Co.,Ltd

QIAGEN s NGS Solutions for Biomarkers NGS & Bioinformatics team QIAGEN (Suzhou) Translational Medicine Co.,Ltd QIAGEN s NGS Solutions for Biomarkers NGS & Bioinformatics team QIAGEN (Suzhou) Translational Medicine Co.,Ltd 1 Our current NGS & Bioinformatics Platform 2 Our NGS workflow and applications 3 QIAGEN s

More information

Figure S1: NUN preparation yields nascent, unadenylated RNA with a different profile from Total RNA.

Figure S1: NUN preparation yields nascent, unadenylated RNA with a different profile from Total RNA. Summary of Supplemental Information Figure S1: NUN preparation yields nascent, unadenylated RNA with a different profile from Total RNA. Figure S2: rrna removal procedure is effective for clearing out

More information

Le proteine regolative variano nei vari tipi cellulari e in funzione degli stimoli ambientali

Le proteine regolative variano nei vari tipi cellulari e in funzione degli stimoli ambientali Le proteine regolative variano nei vari tipi cellulari e in funzione degli stimoli ambientali Tipo cellulare 1 Tipo cellulare 2 Tipo cellulare 3 DNA-protein Crosslink Lisi Frammentazione Immunopurificazione

More information

AP Biology Gene Expression/Biotechnology REVIEW

AP Biology Gene Expression/Biotechnology REVIEW AP Biology Gene Expression/Biotechnology REVIEW Multiple Choice Identify the choice that best completes the statement or answers the question. 1. Gene expression can be a. regulated before transcription.

More information

Top 5 Lessons Learned From MAQC III/SEQC

Top 5 Lessons Learned From MAQC III/SEQC Top 5 Lessons Learned From MAQC III/SEQC Weida Tong, Ph.D Division of Bioinformatics and Biostatistics, NCTR/FDA Weida.tong@fda.hhs.gov; 870 543 7142 1 MicroArray Quality Control (MAQC) An FDA led community

More information

Decoding Chromatin States with Epigenome Data Advanced Topics in Computa8onal Genomics

Decoding Chromatin States with Epigenome Data Advanced Topics in Computa8onal Genomics Decoding Chromatin States with Epigenome Data 02-715 Advanced Topics in Computa8onal Genomics HMMs for Decoding Chromatin States Epigene8c modifica8ons of the genome have been associated with Establishing

More information

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions Joshua N. Burton 1, Andrew Adey 1, Rupali P. Patwardhan 1, Ruolan Qiu 1, Jacob O. Kitzman 1, Jay Shendure 1 1 Department

More information

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University Machine learning applications in genomics: practical issues & challenges Yuzhen Ye School of Informatics and Computing, Indiana University Reference Machine learning applications in genetics and genomics

More information

A comparison of methods for differential expression analysis of RNA-seq data

A comparison of methods for differential expression analysis of RNA-seq data Soneson and Delorenzi BMC Bioinformatics 213, 14:91 RESEARCH ARTICLE A comparison of methods for differential expression analysis of RNA-seq data Charlotte Soneson 1* and Mauro Delorenzi 1,2 Open Access

More information

Introduction to Bioinformatics and Gene Expression Technologies

Introduction to Bioinformatics and Gene Expression Technologies Introduction to Bioinformatics and Gene Expression Technologies Utah State University Fall 2017 Statistical Bioinformatics (Biomedical Big Data) Notes 1 1 Vocabulary Gene: hereditary DNA sequence at a

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 14: Microarray Some slides were adapted from Dr. Luke Huan (University of Kansas), Dr. Shaojie Zhang (University of Central Florida), and Dr. Dong Xu and

More information

DNA Microarray Technology

DNA Microarray Technology CHAPTER 1 DNA Microarray Technology All living organisms are composed of cells. As a functional unit, each cell can make copies of itself, and this process depends on a proper replication of the genetic

More information

Identifying and mitigating bias in next-generation sequencing methods for chromatin biology

Identifying and mitigating bias in next-generation sequencing methods for chromatin biology STUDY DESIGNS Identifying and mitigating bias in next-generation sequencing methods for chromatin biology Clifford A. Meyer and X. Shirley Liu Abstract Next-generation sequencing (NGS) technologies have

More information

Quality Control Assessment in Genotyping Console

Quality Control Assessment in Genotyping Console Quality Control Assessment in Genotyping Console Introduction Prior to the release of Genotyping Console (GTC) 2.1, quality control (QC) assessment of the SNP Array 6.0 assay was performed using the Dynamic

More information

RIPTIDE HIGH THROUGHPUT RAPID LIBRARY PREP (HT-RLP)

RIPTIDE HIGH THROUGHPUT RAPID LIBRARY PREP (HT-RLP) Application Note: RIPTIDE HIGH THROUGHPUT RAPID LIBRARY PREP (HT-RLP) Introduction: Innovations in DNA sequencing during the 21st century have revolutionized our ability to obtain nucleotide information

More information

Gene Expression Technology

Gene Expression Technology Gene Expression Technology Bing Zhang Department of Biomedical Informatics Vanderbilt University bing.zhang@vanderbilt.edu Gene expression Gene expression is the process by which information from a gene

More information

Supplementary Figure 1. HiChIP provides confident 1D factor binding information.

Supplementary Figure 1. HiChIP provides confident 1D factor binding information. Supplementary Figure 1 HiChIP provides confident 1D factor binding information. a, Reads supporting contacts called using the Mango pipeline 19 for GM12878 Smc1a HiChIP and GM12878 CTCF Advanced ChIA-PET

More information

Activation of a Floral Homeotic Gene in Arabidopsis

Activation of a Floral Homeotic Gene in Arabidopsis Activation of a Floral Homeotic Gene in Arabidopsis By Maximiliam A. Busch, Kirsten Bomblies, and Detlef Weigel Presentation by Lis Garrett and Andrea Stevenson http://ucsdnews.ucsd.edu/archive/graphics/images/image5.jpg

More information

Gene Regulation Solutions. Microarrays and Next-Generation Sequencing

Gene Regulation Solutions. Microarrays and Next-Generation Sequencing Gene Regulation Solutions Microarrays and Next-Generation Sequencing Gene Regulation Solutions The Microarrays Advantage Microarrays Lead the Industry in: Comprehensive Content SurePrint G3 Human Gene

More information

WORKSHOP. Transcriptional circuitry and the regulatory conformation of the genome. Ofir Hakim Faculty of Life Sciences

WORKSHOP. Transcriptional circuitry and the regulatory conformation of the genome. Ofir Hakim Faculty of Life Sciences WORKSHOP Transcriptional circuitry and the regulatory conformation of the genome Ofir Hakim Faculty of Life Sciences Chromosome conformation capture (3C) Most GR Binding Sites Are Distant From Regulated

More information

Welcome to the NGS webinar series

Welcome to the NGS webinar series Welcome to the NGS webinar series Webinar 1 NGS: Introduction to technology, and applications NGS Technology Webinar 2 Targeted NGS for Cancer Research NGS in cancer Webinar 3 NGS: Data analysis for genetic

More information

Data and Metadata Models Recommendations Version 1.2 Developed by the IHEC Metadata Standards Workgroup

Data and Metadata Models Recommendations Version 1.2 Developed by the IHEC Metadata Standards Workgroup Data and Metadata Models Recommendations Version 1.2 Developed by the IHEC Metadata Standards Workgroup 1. Introduction The data produced by IHEC is illustrated in Figure 1. Figure 1. The space of epigenomic

More information

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction Gunnar Rätsch Friedrich Miescher Laboratory Max Planck Society, Tübingen, Germany NGS Bioinformatics Meeting, Paris (March 24, 2010)

More information

Sequence Annotation & Designing Gene-specific qpcr Primers (computational)

Sequence Annotation & Designing Gene-specific qpcr Primers (computational) James Madison University From the SelectedWorks of Ray Enke Ph.D. Fall October 31, 2016 Sequence Annotation & Designing Gene-specific qpcr Primers (computational) Raymond A Enke This work is licensed under

More information

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist Whole Transcriptome Analysis of Illumina RNA- Seq Data Ryan Peters Field Application Specialist Partek GS in your NGS Pipeline Your Start-to-Finish Solution for Analysis of Next Generation Sequencing Data

More information

The Next Generation of Transcription Factor Binding Site Prediction

The Next Generation of Transcription Factor Binding Site Prediction The Next Generation of Transcription Factor Binding Site Prediction Anthony Mathelier*, Wyeth W. Wasserman* Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, Department

More information

2 Gene Technologies in Our Lives

2 Gene Technologies in Our Lives CHAPTER 15 2 Gene Technologies in Our Lives SECTION Gene Technologies and Human Applications KEY IDEAS As you read this section, keep these questions in mind: For what purposes are genes and proteins manipulated?

More information

Supplementary Fig. 1 related to Fig. 1 Clinical relevance of lncrna candidate

Supplementary Fig. 1 related to Fig. 1 Clinical relevance of lncrna candidate Supplementary Figure Legends Supplementary Fig. 1 related to Fig. 1 Clinical relevance of lncrna candidate BC041951 in gastric cancer. (A) The flow chart for selected candidate lncrnas in 660 up-regulated

More information

Technical Review. Real time PCR

Technical Review. Real time PCR Technical Review Real time PCR Normal PCR: Analyze with agarose gel Normal PCR vs Real time PCR Real-time PCR, also known as quantitative PCR (qpcr) or kinetic PCR Key feature: Used to amplify and simultaneously

More information

MicroSEQ Rapid Microbial Identification System

MicroSEQ Rapid Microbial Identification System MicroSEQ Rapid Microbial Identification System Giving you complete control over microbial identification using the gold-standard genotypic method The MicroSEQ ID microbial identification system, based

More information

Package goseq. R topics documented: December 23, 2017

Package goseq. R topics documented: December 23, 2017 Package goseq December 23, 2017 Version 1.30.0 Date 2017/09/04 Title Gene Ontology analyser for RNA-seq and other length biased data Author Matthew Young Maintainer Nadia Davidson ,

More information

A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium

A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium SEQC/MAQC-III Consortium* 214 Nature America, Inc. All rights reserved.

More information

Next-Generation Sequencing Gene Expression Analysis Using Agilent GeneSpring GX

Next-Generation Sequencing Gene Expression Analysis Using Agilent GeneSpring GX Next-Generation Sequencing Gene Expression Analysis Using Agilent GeneSpring GX Technical Overview Introduction RNA Sequencing (RNA-Seq) is one of the most commonly used next-generation sequencing (NGS)

More information

RNA spike-in controls & analysis methods for trustworthy genome-scale measurements

RNA spike-in controls & analysis methods for trustworthy genome-scale measurements RNA spike-in controls & analysis methods for trustworthy genome-scale measurements Sarah A. Munro, Ph.D. Genome-Scale Measurements Group ABRF Meeting March 29, 2015 Overview External RNA Controls Consortium

More information

Genomics and Gene Recognition Genes and Blue Genes

Genomics and Gene Recognition Genes and Blue Genes Genomics and Gene Recognition Genes and Blue Genes November 1, 2004 Prokaryotic Gene Structure prokaryotes are simplest free-living organisms studying prokaryotes can give us a sense what is the minimum

More information

The Two-Hybrid System

The Two-Hybrid System Encyclopedic Reference of Genomics and Proteomics in Molecular Medicine The Two-Hybrid System Carolina Vollert & Peter Uetz Institut für Genetik Forschungszentrum Karlsruhe PO Box 3640 D-76021 Karlsruhe

More information

Sort-seq under the hood: implications of design choices on largescale characterization of sequence-function relations

Sort-seq under the hood: implications of design choices on largescale characterization of sequence-function relations Sort-seq under the hood: implications of design choices on largescale characterization of sequence-function relations The Harvard community has made this article openly available. Please share how this

More information

Introductory Next Gen Workshop

Introductory Next Gen Workshop Introductory Next Gen Workshop http://www.illumina.ucr.edu/ http://www.genomics.ucr.edu/ Workshop Objectives Workshop aimed at those who are new to Illumina sequencing and will provide: - a basic overview

More information

Performance characteristics of the High Sensitivity DNA kit for the Agilent 2100 Bioanalyzer

Performance characteristics of the High Sensitivity DNA kit for the Agilent 2100 Bioanalyzer Performance characteristics of the High Sensitivity DNA kit for the Agilent 2100 Bioanalyzer Technical Note 10 Measured conc. [ng/µl] 1 Y intercept = 0.09 r 2 = 0.993 0.1 0.1 1 10 Reference concentration

More information

ChIP-seq data analysis

ChIP-seq data analysis hip-seq data analysis Harri Lähdesmäki Department of omputer Science Aalto University January 8, 2015 Motivation: transcription factor binding site (TBS) prediction Last time we studied computational methods

More information

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang Supplementary Materials for: Detecting very low allele fraction variants using targeted DNA sequencing and a novel molecular barcode-aware variant caller Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John

More information

What we ll do today. Types of stem cells. Do engineered ips and ES cells have. What genes are special in stem cells?

What we ll do today. Types of stem cells. Do engineered ips and ES cells have. What genes are special in stem cells? Do engineered ips and ES cells have similar molecular signatures? What we ll do today Research questions in stem cell biology Comparing expression and epigenetics in stem cells asuring gene expression

More information

Supplementary Data for DNA sequence+shape kernel enables alignment-free modeling of transcription factor binding.

Supplementary Data for DNA sequence+shape kernel enables alignment-free modeling of transcription factor binding. Supplementary Data for DNA sequence+shape kernel enables alignment-free modeling of transcription factor binding. Wenxiu Ma 1, Lin Yang 2, Remo Rohs 2, and William Stafford Noble 3 1 Department of Statistics,

More information

Technical note: Molecular Index counting adjustment methods

Technical note: Molecular Index counting adjustment methods Technical note: Molecular Index counting adjustment methods By Jue Fan, Jennifer Tsai, Eleen Shum Introduction. Overview of BD Precise assays BD Precise assays are fast, high-throughput, next-generation

More information

Quantitation of mrna Using Real-Time Reverse Transcription PCR (RT-PCR)

Quantitation of mrna Using Real-Time Reverse Transcription PCR (RT-PCR) Quantitation of mrna Using Real-Time Reverse Transcription PCR (RT-PCR) Quantitative Real-Time RT-PCR Versus RT-PCR In Real-Time RT- PCR, DNA amplification monitored at each cycle but RT-PCR measures the

More information

Ensembl Funcgen: A Database and API for Epigenomics and Gene Regulation Data.

Ensembl Funcgen: A Database and API for Epigenomics and Gene Regulation Data. Ensembl Funcgen: A Database and API for Epigenomics and Gene Regulation Data. Nathan Johnson Ensembl Regulation EBI is an Outstation of the European Molecular Biology Laboratory.! Workshop Overview http://www.ebi.ac.uk/~njohnson/courses/23.05.2013-

More information

Analysis of a Tiling Regulation Study in Partek Genomics Suite 6.6

Analysis of a Tiling Regulation Study in Partek Genomics Suite 6.6 Analysis of a Tiling Regulation Study in Partek Genomics Suite 6.6 The example data set used in this tutorial consists of 6 technical replicates from the same human cell line, 3 are SP1 treated, and 3

More information

Do engineered ips and ES cells have similar molecular signatures?

Do engineered ips and ES cells have similar molecular signatures? Do engineered ips and ES cells have similar molecular signatures? Comparing expression and epigenetics in stem cells George Bell, Ph.D. Bioinformatics and Research Computing 2012 Spring Lecture Series

More information

ENCODE RBP Antibody Characterization Guidelines

ENCODE RBP Antibody Characterization Guidelines ENCODE RBP Antibody Characterization Guidelines Approved on November 18, 2016 Background An integral part of the ENCODE Project is to characterize the antibodies used in the experiments. This document

More information

Mate-pair library data improves genome assembly

Mate-pair library data improves genome assembly De Novo Sequencing on the Ion Torrent PGM APPLICATION NOTE Mate-pair library data improves genome assembly Highly accurate PGM data allows for de Novo Sequencing and Assembly For a draft assembly, generate

More information

Next-Generation Sequencing. Technologies

Next-Generation Sequencing. Technologies Next-Generation Next-Generation Sequencing Technologies Sequencing Technologies Nicholas E. Navin, Ph.D. MD Anderson Cancer Center Dept. Genetics Dept. Bioinformatics Introduction to Bioinformatics GS011062

More information

Differential Gene Expression

Differential Gene Expression Biology 4361 Developmental Biology Differential Gene Expression September 28, 2006 Chromatin Structure ~140 bp ~60 bp Transcriptional Regulation: 1. Packing prevents access CH 3 2. Acetylation ( C O )

More information

Incorporating Molecular ID Technology. Accel-NGS 2S MID Indexing Kits

Incorporating Molecular ID Technology. Accel-NGS 2S MID Indexing Kits Incorporating Molecular ID Technology Accel-NGS 2S MID Indexing Kits Molecular Identifiers (MIDs) MIDs are indices used to label unique library molecules MIDs can assess duplicate molecules in sequencing

More information

Lecture Four. Molecular Approaches I: Nucleic Acids

Lecture Four. Molecular Approaches I: Nucleic Acids Lecture Four. Molecular Approaches I: Nucleic Acids I. Recombinant DNA and Gene Cloning Recombinant DNA is DNA that has been created artificially. DNA from two or more sources is incorporated into a single

More information

Bayesian Variable Selection and Data Integration for Biological Regulatory Networks

Bayesian Variable Selection and Data Integration for Biological Regulatory Networks Bayesian Variable Selection and Data Integration for Biological Regulatory Networks Shane T. Jensen Department of Statistics The Wharton School, University of Pennsylvania stjensen@wharton.upenn.edu Gary

More information

Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome

Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome Ruth Howe Bio 434W 27 February 2010 Abstract The fourth or dot chromosome of Drosophila species is composed primarily of highly condensed,

More information

Human SNP haplotypes. Statistics 246, Spring 2002 Week 15, Lecture 1

Human SNP haplotypes. Statistics 246, Spring 2002 Week 15, Lecture 1 Human SNP haplotypes Statistics 246, Spring 2002 Week 15, Lecture 1 Human single nucleotide polymorphisms The majority of human sequence variation is due to substitutions that have occurred once in the

More information

Chapter 20: Biotechnology

Chapter 20: Biotechnology Name Period The AP Biology exam has reached into this chapter for essay questions on a regular basis over the past 15 years. Student responses show that biotechnology is a difficult topic. This chapter

More information

COMPAS for the Analysis of SELEX Experiments

COMPAS for the Analysis of SELEX Experiments COMPAS for the Analysis of SELEX Experiments COMPAS (COMmon PAtternS) is a software tool that was especially developed to harness the technology of next generation sequencing (NGS) to bring light into

More information

Chapter 18: Regulation of Gene Expression. 1. Gene Regulation in Bacteria 2. Gene Regulation in Eukaryotes 3. Gene Regulation & Cancer

Chapter 18: Regulation of Gene Expression. 1. Gene Regulation in Bacteria 2. Gene Regulation in Eukaryotes 3. Gene Regulation & Cancer Chapter 18: Regulation of Gene Expression 1. Gene Regulation in Bacteria 2. Gene Regulation in Eukaryotes 3. Gene Regulation & Cancer Gene Regulation Gene regulation refers to all aspects of controlling

More information

Long and short/small RNA-seq data analysis

Long and short/small RNA-seq data analysis Long and short/small RNA-seq data analysis GEF5, 4.9.2015 Sami Heikkinen, PhD, Dos. Topics 1. RNA-seq in a nutshell 2. Long vs short/small RNA-seq 3. Bioinformatic analysis work flows GEF5 / Heikkinen

More information

Analysis of Biological Sequences SPH

Analysis of Biological Sequences SPH Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu nuts and bolts meet Tuesdays & Thursdays, 3:30-4:50 no exam; grade derived from 3-4 homework assignments plus a final project (open book,

More information

COMPUTER RESOURCES II:

COMPUTER RESOURCES II: COMPUTER RESOURCES II: Using the computer to analyze data, using the internet, and accessing online databases Bio 210, Fall 2006 Linda S. Huang, Ph.D. University of Massachusetts Boston In the first computer

More information

Lecture 7: April 7, 2005

Lecture 7: April 7, 2005 Analysis of Gene Expression Data Spring Semester, 2005 Lecture 7: April 7, 2005 Lecturer: R.Shamir and C.Linhart Scribe: A.Mosseri, E.Hirsh and Z.Bronstein 1 7.1 Promoter Analysis 7.1.1 Introduction to

More information

Year III Pharm.D Dr. V. Chitra

Year III Pharm.D Dr. V. Chitra Year III Pharm.D Dr. V. Chitra 1 Genome entire genetic material of an individual Transcriptome set of transcribed sequences Proteome set of proteins encoded by the genome 2 Only one strand of DNA serves

More information

FACTORS CONTRIBUTING TO VARIABILITY IN DNA MICROARRAY RESULTS: THE ABRF MICROARRAY RESEARCH GROUP 2002 STUDY

FACTORS CONTRIBUTING TO VARIABILITY IN DNA MICROARRAY RESULTS: THE ABRF MICROARRAY RESEARCH GROUP 2002 STUDY FACTORS CONTRIBUTING TO VARIABILITY IN DNA MICROARRAY RESULTS: THE ABRF MICROARRAY RESEARCH GROUP 2002 STUDY K. L. Knudtson 1, C. Griffin 2, A. I. Brooks 3, D. A. Iacobas 4, K. Johnson 5, G. Khitrov 6,

More information

Atelier Chip-Seq. Stéphanie Le Gras, IGBMC Strasbourg Violaine Saint-André, Institut Curie Paris Morgane Thomas-Chollier, ENS Paris

Atelier Chip-Seq. Stéphanie Le Gras, IGBMC Strasbourg Violaine Saint-André, Institut Curie Paris Morgane Thomas-Chollier, ENS Paris Atelier Chip-Seq Stéphanie Le Gras, IGBMC Strasbourg Violaine Saint-André, Institut Curie Paris Morgane Thomas-Chollier, ENS Paris École de bioinformatique AVIESAN-IFB 2017 Get connected to the server

More information

Systematic comparison of CRISPR/Cas9 and RNAi screens for essential genes

Systematic comparison of CRISPR/Cas9 and RNAi screens for essential genes CORRECTION NOTICE Nat. Biotechnol. doi:10.1038/nbt. 3567 Systematic comparison of CRISPR/Cas9 and RNAi screens for essential genes David W Morgens, Richard M Deans, Amy Li & Michael C Bassik In the version

More information

Introduction to Next Generation Sequencing (NGS) Data Analysis and Pathway Analysis. Jenny Wu

Introduction to Next Generation Sequencing (NGS) Data Analysis and Pathway Analysis. Jenny Wu Introduction to Next Generation Sequencing (NGS) Data Analysis and Pathway Analysis Jenny Wu Outline Introduction to NGS data analysis in Cancer Genomics NGS applications in cancer research Typical NGS

More information

Myers Lab ChIP-seq Protocol v Modified January 10, 2014

Myers Lab ChIP-seq Protocol v Modified January 10, 2014 Myers Lab ChIP-seq Protocol V011014 1 Contact information: Dr. Florencia Pauli Behn HudsonAlpha Institute for Biotechnology 601 Genome Way Huntsville, AL 35806 Telephone: 256-327-5229 Email: fpauli@hudsonalpha.org

More information