Biologists on the cloud

Similar documents
Galaxy Platform For NGS Data Analyses

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

Applications of short-read

earray 5.0 Create your own Custom Microarray Design

Sequencing applications. Today's outline. Hands-on exercises. Applications of short-read sequencing: RNA-Seq and ChIP-Seq

Galaxy for Next Generation Sequencing 初探次世代序列分析平台 蘇聖堯 2013/9/12

Deep Sequencing technologies

RNAseq Differential Gene Expression Analysis Report

TREE CODE PRODUCT BROCHURE

NGS Data Analysis and Galaxy

Course Presentation. Ignacio Medina Presentation

Green Center Computational Core ChIP- Seq Pipeline, Just a Click Away

Agilent Genomic Workbench 7.0

Parts of a standard FastQC report

Measuring and Understanding Gene Expression

Gene expression analysis. Biosciences 741: Genomics Fall, 2013 Week 5. Gene expression analysis

RNA-Seq Analysis. Simon Andrews, Laura v

Next Generation Sequencing

RNA-Seq Software, Tools, and Workflows

Lecture 7. Next-generation sequencing technologies

Introduction to DNA-Sequencing

ChIP-seq analysis 2/28/2018

TECH NOTE Ligation-Free ChIP-Seq Library Preparation

ChIP-seq and RNA-seq

Discovering gene regulatory control using ChIP-chip and ChIP-seq. Part 1. An introduction to gene regulatory control, concepts and methodologies

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility

Introduction to human genomics and genome informatics

ChIP-seq and RNA-seq. Farhat Habib

NGS Approaches to Epigenomics

Introduction to genome biology

ChIP-seq data analysis with Chipster. Eija Korpelainen CSC IT Center for Science, Finland

Bioinformatics- Data Analysis

Introduction to RNAseq Analysis. Milena Kraus Apr 18, 2016

Introduction to genome biology

Globus Genomics Powered by Galaxy. Ravi K Madduri and Globus Genomics Team

Discovering gene regulatory control using ChIP-chip and ChIP-seq. An introduction to gene regulatory control, concepts and methodologies

Methods of Biomaterials Testing Lesson 3-5. Biochemical Methods - Molecular Biology -

G E N OM I C S S E RV I C ES

Quality Control of Next Generation Sequence Data

Bioinformatics Monthly Workshop Series. Speaker: Fan Gao, Ph.D Bioinformatics Resource Office The Picower Institute for Learning and Memory

Transcriptome analysis

The first and only fully-integrated microarray instrument for hands-free array processing

Genome Sequence Assembly

Introduction to Bioinformatics and Gene Expression Technologies

Introduction to Bioinformatics and Gene Expression Technologies

Benchmarking of RNA-seq data processing pipelines using whole transcriptome qpcr expression data

Illumina s Suite of Targeted Resequencing Solutions

PRESENTING SEQUENCES 5 GAATGCGGCTTAGACTGGTACGATGGAAC 3 3 CTTACGCCGAATCTGACCATGCTACCTTG 5

Novel methods for RNA and DNA- Seq analysis using SMART Technology. Andrew Farmer, D. Phil. Vice President, R&D Clontech Laboratories, Inc.

Introduction into single-cell RNA-seq. Kersti Jääger 19/02/2014

Satellite Education Workshop (SW4): Epigenomics: Design, Implementation and Analysis for RNA-seq and Methyl-seq Experiments

UAB DNA-Seq Analysis Workshop. John Osborne Research Associate Centers for Clinical and Translational Science

Sanger vs Next-Gen Sequencing

WELCOME. Norma J. Nowak, PhD Executive Director, NY State Center of Excellence in Bioinformatics and Life Sciences (CBLS)

LARGE DATA AND BIOMEDICAL COMPUTATIONAL PIPELINES FOR COMPLEX DISEASES

NEXT GENERATION SEQUENCING. Farhat Habib

TECH NOTE Stranded NGS libraries from FFPE samples

Exercises: Analysing ChIP-Seq data

Integrated NGS Sample Preparation Solutions for Limiting Amounts of RNA and DNA. March 2, Steven R. Kain, Ph.D. ABRF 2013

454 Sample Prep / Workflow at the BioMedical Genomics Center (BMGC) University of Minnesota. Sushmita Singh

Introduction to RNA-Seq in GeneSpring NGS Software

Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ),

RNA-Sequencing analysis

Smart India Hackathon

Browser Exercises - I. Alignments and Comparative genomics

Next Genera*on Sequencing So2ware for Data Management, Analysis, and Visualiza*on. Session W14

Welcome to the NGS webinar series

Next Gen Sequencing. Expansion of sequencing technology. Contents

Globus Genomics at GSI Boston University. Dinanath Sulakhe, Alex Rodriguez

resequencing storage SNP ncrna metagenomics private trio de novo exome ncrna RNA DNA bioinformatics RNA-seq comparative genomics

Wet-lab Considerations for Illumina data analysis

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

Computational Challenges in Life Sciences Research Infrastructures

Aim of lecture:to get an overview of the whole process of microarrays, from study design to publication

Introduction to BIOINFORMATICS

RNA-Seq data analysis course September 7-9, 2015

CS273B: Deep learning for Genomics and Biomedicine

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Supplemental Figure 1 A

SMRT Analysis Barcoding Overview (v6.0.0)

Introduction to NGS analyses

What is Bioinformatics?

Introductory Next Gen Workshop

ChIP-seq/Functional Genomics/Epigenomics. CBSU/3CPG/CVG Next-Gen Sequencing Workshop. Josh Waterfall. March 31, 2010

NGS part 2: applications. Tobias Österlund

Week 1 BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Supplementary Information for:

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

Introduction to transcriptome analysis using High Throughput Sequencing technologies. D. Puthier 2012

Transcriptome Assembly, Functional Annotation (and a few other related thoughts)

Exploring and Understanding ChIP-Seq data. Simon v

solid S Y S T E M s e q u e n c i n g See the Difference Discover the Quality Genome

High Throughput Sequencing the Multi-Tool of Life Sciences. Lutz Froenicke DNA Technologies and Expression Analysis Cores UCD Genome Center

Charles Girardot, Furlong Lab. MACS, CisGenome, SISSRs and other peak calling algorithms: differences and practical use

Introduction to Bioinformatics and Gene Expression Technology

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Quality assurance in NGS (diagnostics)

RNA Seq: Methods and Applica6ons. Prat Thiru

Illumina Sequencing Error Profiles and Quality Control

Outline. Array platform considerations: Comparison between the technologies available in microarrays

Transcription:

Biologists on the cloud our experiences using galaxy for next-gen sequencing analyses Karen Reddy, Johns Hopkins University School of Medicine Mo Heydarian, Johns Hopkins University School of Medicine

Warning!!! WE ARE END USERS Blinking cursors on a black screen make us break out in a cold sweat (i.e. anything remotely resembling command line) PuOy, SOAP, keys etc may mean something different to us

Who we are: Interested in organization of the nucleus with respect to gene regulation Epigenetics, transcription, cell biology, structure

Proteins, DNA and RNA are compartmentalized Nuclear domains Spector, Journal Cell Science

Who we are: Interested in organization of the nucleus with respect to gene regulation Epigenetics, transcription, cell biology, structure Molecular approaches ChIP-seq Hi-C DamID/DamID-seq RNA-seq SmallRNA-seq Proteomics Cellular approaches GALAXY Immuno-DNA/RNA- FISH ValidaVon, cellular and funcvonal Assays Immunofluorescence Phenotype/ developmental assays

We need to be able to make sense of our billions of NGS reads Why Galaxy? We need to perform the analysis Lack of bioinformatician availability We need to understand what is being done to our data Create workflows and infrastructure for the lab (and the community) Additional and portable skill set for trainees

Analysis opvons using Galaxy On a cluster/local instance Resources Expertise ($$$) Space ($$$) Access Command line... Galaxy Cloudman we do not have to maintain!! Portable pre-packaged

AddiVonal Benefits Using Galaxy Cloudman User community (http://wiki.g2.bx.psu.edu/learn/screencasts) Graphical user interface Intuitive packages Alleviates dependency on external resources Scalability perfect for intermittent need and smaller laboratories Extensively vetted tools avaialble

Two examples

DamID- seq Adapted protocol No analysis pipeline existed We consulted with a bioinformatician on the broad strokes A workflow was created using common straightforward Galaxy tools The resultant data was confirmed by independent DamID array hybridization Traditional peak-calling does not work as these regions are very broad domains in the genome

Strategy: DamID Greil, Moorman and van Steensel 2006

DNA Adenine Methylase IdenVficaVon DamID

DamID- Seq : NOT DRAWN TO SCALE! Lamina Associated Domains (LADs) should be much longer than adapters! (LADs ~ 200 4000 bp; Adapter 15 bp) RandomisaVon: End Repair and ligate SonicaVon Library Prep: End Repair, A tailing, adapter ligavon, PCR - > cluster generavon and sequencing

Galaxy Workflow Quality trimming ends of reads (FASTQ); sliding window of size 3; Threshold: mean of scores 30. Replace DamID adapter or primer sequences by delimiters. Bin 1: Reads with DamID primer (AdrPCR) consistent with DamID protocol (ie GATC regenerated) Bin 2: Reads with DamID primer (AdrPCR) not consistent with DamID protocol (Primer dimers, genomic sequences with AdrPCR) Bin 3: Truncated DamID primers at the ends of reads AND Reads without adapter or primer sequences Replace delimiters by tab to retrieve flanking sequences For sequences without any indicavon of primer dimers (ie only one AdrPCR delimiter is observed without adjacent delimiters), replace the delimiter assigned for AdrPCR back with its respecvve sequence to regenerate genomic sequences that might contain the AdrPCR sequence Concatenate files end to start to combine reads Convert to FASTA and Filter by length > 25. 1 st BowVe; mm9 (default selngs)

Filter Sam to retrieve mapped and unmapped reads Mapped Reads; Convert to intervals Unmapped Reads (might contain truncated primers at either end); Trim 13 bases off 5 end of reads; filter by length > 25 2 nd BowVe; mm9 (Default Selngs) Filter Sam to retrieve mapped and unmapped reads Unmapped Reads (might contain truncated primers at 3 end); Mapped Reads; Convert to intervals Trim 12 bases off 5 end of reads; filter by length > 25 3rd BowVe; mm9 (Default Selngs) *Concatenate files start to end Mm9 genome is binned by DpnI sites and an interval file is generated. Count number of Vmes a genomic fragment that is flanked by DpnI sites has a mapped read that overlaps with it to obtain fragment scores Filter Sam to retrieve mapped reads Mapped Reads; Convert to intervals Normalise fragment score by frgment size and #lines from * to obtain normalised fragment scores. For each interval, normalise the fragment score of DamX to Dam and take the log2 ravo.

VerificaVon of workflow/analysis by comparing DamID- seq with DamID- Vled array

We want to do more of the fun stuff with bioinformavcians like developing algorithms to further analyze our data

Probes were made to LADed regions (red, Cy3) and unladed regions (cyan, Cy5) CollaboraVon with Agilent

RNA- seq We performed directional RNA-seq on mouse primary lymphoid cell lines Aimed to profile gene expression and discover ncrnas With assistance from the Galaxy community and the literature we were able to perform the analysis on our own using the intuitive Galaxy Cloudman https://main.g2.bx.psu.edu/u/jeremy/p/galaxy-rna-seq-analysis-exercise Trapnell, C., et. al.. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protocols. March 1 2012.

On- line tutorials, guides and workflows invaluable

VisualizaVon- UCSC

VisualizaVon- UCSC TSS EpigeneVc modificavons associated with acvve transcripvon

The ulvmate visualizavon (for us)

Issues encountered on the cloud Institutional issues IT and bioinformaticians not fluent in cloud based computing University does not have clear policies about data on the cloud (especially important for clinical data) Not many issues with using we have been pretty happy HOWEVER We have encountered two major issues:

Issue 1 (which led to issue 2) Judging capacity planning ( hop://wiki.g2.bx.psu.edu/cloudman/capacityplanning) We use Brutus configuravon now: head=high Memory XL Worker(s)=High Memory 2XL

Issue 2: Keys, keys and more keys ConnecVng EBS to EC2 drives

Keys, keys and more keys What key(s) do we need for what? A bit of difficulty in biologists communicavng with technical support (which key??)

Lessons learned and words of advice Better communication between tech support and the end user would make life easier Use the tutorial/wiki (http://wiki.g2.bx.psu.edu/cloudman) Capacity planning. Go big. We like using at least Hi-Mem Double XL workers Keep small head node up and running (EC2) and shunt all jobs to workers this will avoid mounting issues and data loss.

Acknowledgements: Mo Heydarian- RNA seq, lncrna, proteome Xianrong Jose Wong- nuclear structure in aging Teresa Romeo DamID mapping and nuclear structure in development Agilent (Vled OLIDS- chromosome paints) Alice Yamada and team GALAXY TEAM AND COMMUNITY!!!!!!!