Green Center Computational Core ChIP- Seq Pipeline, Just a Click Away Venkat Malladi Computational Biologist Computational Core Cecil H. and Ida Green Center for Reproductive Biology Science
Introduc<on to the Green Center Basic research in female reproductive biology, with a focus on signaling, gene regulation, and genome function. pregnancy parturition stem cells oncology inflammation Key areas: Chromatin structure and gene regulation Epigenetics Nuclear endpoints of cellular signaling pathways Genome organization and evolution DNA replication and repair
Who is in the Green Center? Associated with the Department of Obstetrics and Gynecology Consists of 9 main faculty/labs 20 associated faculty/labs Computational Core W. Lee Kraus, Ph.D., Director of the Green Center.
Role of the Computa<onal Core Consists of 4 Computational Biologists Analysis of Genomic Sequencing Data Responsibilities Data Quality assurance Perform basic analyses Work with investigator to perform integrative analyses Green Center Computation Team Anusha Nagari Tulip Nandu Venkat Malladi Aishwarya Gogate
Challenge: Variety of Assays Supported? ATAC-seq RNA-seq GRO-seq Modified from PLoS Biol 9- e1001046,2011 (M. Pazin)
What is ATAC- seq? Assay for transposase-accessible chromatin using Sequencing (ATAC-Seq): Genomic method that captures open chromatin sites. Buenrostro et al. ( 2013) Nature Methods
What is RNA- Seq? RNA Sequencing (RNA-Seq) : RNA-seq measures RNA abundance of mature RNA species in the cell. These experiments contribute to the understanding of how RNA-based mechanisms impact gene regulation. Types: Total RNA polya mrna (Long and short) shrna small RNA microrna polya depleted RNA
What is GRO- Seq? Global Run On Sequencing (GRO-Seq) : This is a genomic method that maps the position and orientation of all actively transcribing RNA polymerases. Transcription from all three RNA Polymerases is captured providing transcriptional profiles including: protein coding mrna long non-coding RNAs (lncrnas) enhancer RNAs (ernas) divergent transcription antisense transcription intergenic transcription in both annotated and unannotated regions of the genome. ERα Enhancer Annotated Annotated Intergenic Divergent Antisense Other Genic Hah et al. ( 2011) Cell
What is ChIP- Seq? Chromatin immunoprecipitation followed by Sequencing (ChIP-Seq): Identify the binding sites of chromatin-associated proteins. Categories: Transcription factor ChIP-Seq: proteins that associate with specific DNA sequences to influence the rate of transcription Histone ChIP-Seq: measure histone content of chromatin, specifically to the incorporation of particular posttranslational histone modifications in chromatin Park ( 2009) Nature Reviews
Considera<on of making a Pipeline 1. Who are the users 2. Define what the pipeline should deliver 3. Identify all input and output files 4. What QA/QC metrics should be available for users 5. Identify all software used in pipeline 6. Breakdown pipeline into discrete steps (based on deliverable files and metrics)
Users and Goals Users: Wet lab scientists (Grad Students/Post Docs) Computational Biologists in the Green Center Goals: Allow wet lab scientists to quickly assess the quality and explore their data Allow for easily reproducible analysis within the Green Center
Schema: ChIP- seq Pipeline QA Metrics QA Metrics FASTQ (SE/PE) Map bowtie2 BAM Remove Duplicates picard BAM Crosscorrelation Quality fastqc tagalign Fragment size QA Metrics bigwig Call Peaks macs2 narrow Peak
FASTQ: Quality Metrics FastQC Repor Summary Basic Statistics Per base sequence quality Per sequence quality scores Per base sequence content Per base GC content Basic Statistics Measure Value Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 22571166 Filtered Sequences 0 Sequence length 50 %GC 42 Per Base Sequence Quality Per sequence GC content Per base N content Sequence Length Distribution Good quality calls Reasonable quality calls Sequence Duplication Levels Overrepresented sequences Poor quality calls Kmer Content
Alignment: Quality Metrics FASTQ File: DNA sequence Aligned File: DNA sequence + Genomic localization Alignment % = No. of aligned reads Total no. of raw reads * 100
Uniquely Mapped Reads: Quality Metrics Depth Number of uniquely mapping reads Library Complexity Non-Redundant Fraction (NRF) - Number of distinct uniquely mapping reads (i.e. after removing duplicates) / Total number of reads. PCR Bottlenecking Coefficient 1 (PBC1) PBC1=M1/M_DISTINCT where M1: number of genomic locations where exactly one read maps uniquely M_DISTINCT: number of distinct genomic locations to which some read maps uniquely PCR Bottlenecking Coefficient 2 (PBC2) PBC2= M1/M2 where M1: number of genomic locations where only one read maps uniquely M2: number of genomic locations where two reads map uniquely ENCODE Standards hpps://www.encodeproject.org/data- standards/chip- seq/
Uniquely Mapped Reads: Quality Metrics (cont.) NRF Guidelines PBC1 Guidelines PBC2 Guidelines ENCODE Standards hpps://www.encodeproject.org/data- standards/chip- seq/
Alignment: Quality Metrics Report Sample Information Raw reads Alignment % Control Replicate 1 28,259,069 96.30% Control Replicate 2 28,892,302 96.00% Sample 2 Replicate 1 23,239,486 96.10% Sample 2 Replicate 2 25,637,094 96.90% Sample 3 Replicate 1 22,713,054 96.60% Sample 3 Replicate 2 20,419,272 95.90% Sample 4 Replicate 1 22,617,154 96.60% Sample 4 Replicate 2 20,068,460 96.00%
Cross- correla<on: Quality Metrics Report Sample 1 Sample 2 R=0.99 R=0.99 R: Pearson correlation coefficient
Call Peaks: Quality Metrics Report 1. Peak calls for individual replicates 2. Overlapping peaks between the pooled pseudo replicates 3. Bigwig files (UCSC Genome Browser, IGV )
Call Peaks: Quality Metrics Report Visualizing signal tracks (Bigwig files) in UCSC Genome Browser: Franco et al (2015)
Working With BioHPC and Astrocyte
Crea<ng a Project Create New Project to run analysis
Adding Data Select Add Data to this Project...
ChIP- Seq Workflow ChIP-Input fastq files Sequence format ChIP TF or Histone fastq files Assembly
Run Time of ChIP- Seq Pipeline
Thank you! Questions?