Analysis of NGS data. resources. Grid computing workshop 2015 Jan Oppelt, NCBR & CEITEC MU 1 st December, 2015

Size: px
Start display at page:

Download "Analysis of NGS data. resources. Grid computing workshop 2015 Jan Oppelt, NCBR & CEITEC MU 1 st December, 2015"

Transcription

1 Analysis of NGS data using MetaCentrum VO resources Grid computing workshop 2015 Jan Oppelt, NCBR & CEITEC MU 1 st December, 2015

2 Basic introduction 12/1/2015 Jan Oppelt, NCBR & CEITEC MU 2

3 Introduction Jan Oppelt National Centre of Biomolecular Research Laboratory of Computational Chemistry (NCBR LCC), Masaryk University, Brno CEITEC MU, Brno Sequencing bioinformatics Analysis of next-generation sequencing (NGS) data Gene expression (RNA-Seq, srna-seq), transcriptome assembly, genome assembly, SNP analysis, 12/1/2015 Jan Oppelt, NCBR & CEITEC MU 3

4 Main topic Next-generation sequencing/massive-parallel sequencing experiments generate volumes of data Requires a computationally intensive system for data storage, management and processing Not everybody can afford to acquire sufficient computational resources to handle the analysis locally Availability of MetaCentrum VO and its resources makes NGS analysis possible for everybody Improves scientific progress in the field of bioinformatics 12/1/2015 Jan Oppelt, NCBR & CEITEC MU 4

5 Problems to be handled Volume of input data NGS creates high volume of raw data/files (tens of GB) Number of input data you can get up to hundreds of samples to analyze at once Speed of processing some applications need to be processed in very short time Computational demands some applications need tens or hundreds of GB RAM memory Parallel analyses it is often necessary to test several settings for the analysis to get it optimal Long term data storage some samples need to be stored for long time to reanalyze in the future 12/1/2015 Jan Oppelt, NCBR & CEITEC MU 5

6 Case studies 12/1/2015 Jan Oppelt, NCBR & CEITEC MU 6

7 Case study I de novo genome assembly Nowadays, whole-genome sequencing is a common use of NGS We can sequence unknown genome and determine its sequence from scratch Study of evolution, gene analysis, hereditary connections, characterization of a organism, For a basic de novo genome assembly we need to sequence the genome not once, but approx x! (and a bit of luck) Creates large volume of data which needs to be processed Assembly itself is also computationally very demanding 12/1/2015 Jan Oppelt, NCBR & CEITEC MU 7

8 Case study I de novo assembly principle A genome assembly is an attempt to accurately represent an entire genome sequence from a large set of very short DNA sequences. Instinctively like a jigsaw puzzle Find reads which fit together (overlap) Could be missing pieces (sequencing bias) Some pieces will be dirty (sequencing errors) Large set of very short DNA sequences = 2x180M reads with length of bp per sample which needs to be put together 12/1/2015 Jan Oppelt, NCBR & CEITEC MU 8

9 Case study I camel genome(s) assembly University of Veterinary and Pharmaceutical Sciences Brno (prof. Petr Hořín) and Vetmeduni Vienna (Pamela A. Burger) Aim to de novo assemble camels genomes (Vienna) Secondary aim to analyze Major histocompatibility complex (MHC) region many immune response related genes (me) 3 camel breeds all sequenced 3 times (PE reads, 100 bp) Input amount of data = 769 GB! 12/1/2015 Jan Oppelt, NCBR & CEITEC MU 9

10 Case study I data processing 1. Quality check 2. Pre-processing 3. Read correction (SOAPcorrection, QUAKE) 4. De novo assembly 1 (SOAPdenovo2) 5. De novo assembly 2 (ABySS) 6. Evaluation of the assembly 7. Identification of a specific region 8. Genomes comparison 12/1/2015 Jan Oppelt, NCBR & CEITEC MU 10

11 Case study I required resources Each assembly requires RAM memory at least 512 GB RAM HDD at least 500 GB of immediate storage Cores at least 32 cores Time 2-6 days Advantage run it in parallel Analysis done for all samples ~10 days including all steps Possible only thanks to MetaCentrum VO resources Improved assembly but MHC not captured sequencing technology not optimal 12/1/2015 Jan Oppelt, NCBR & CEITEC MU 11

12 Case study II RNA-Seq Classic use of NGS one of the most common applications Analysis of gene expression The goal is to determine the gene expression and find the differences Healthy patients vs. disease patients = differences in gene expression -> identification of genes responsible for disease Speed and number of samples is the issue here 12/1/2015 Jan Oppelt, NCBR & CEITEC MU 12

13 Case study II RNA-Seq principle Two groups of patients before and after treatment, healthy and ill, Extract RNA from the samples = expression of all genes (in theory) Cut it to small pieces Sequence Find the origin Count how many times a gene was there Compare groups 12 samples by 50M reads to be analyzed (quickly) 12/1/2015 Jan Oppelt, NCBR & CEITEC MU 13

14 Case study II liver removal response Faculty Hospital Ostrava, Faculty of Medicine + Faculty of Science, University of Ostrava (assoc. prof. Petr Pečinka) Aim to compare two groups after liver removal One without any treatment One with stem-cells application Identification of genes with changed expression to identify genes with possible use for recovery state determination 2 groups by 6 samples (PE reads, 150 bp) Input amount of data = 250 GB 12/1/2015 Jan Oppelt, NCBR & CEITEC MU 14

15 Case study II data processing 1. Quality check 2. Pre-processing (cutadapt, trimmomatic) 3. Alignment (STAR) 4. Quality check 5. Post-processing (SAMtools, bamutils, picardtools) 6. Visualization 7. Counting (featurecounts, htseq-count) 8. Statistical analysis (R edger, DESeq2, bayseq) 12/1/2015 Jan Oppelt, NCBR & CEITEC MU 15

16 Case study II required resources Each analysis requires RAM memory at least 40 GB RAM HDD at least 80 GB of immediate storage Cores at least 8 cores Time 1-2 days Advantage run it in parallel for many samples Analysis of samples done in 2 days Possible only thanks to MetaCentrum VO resources mainly the speed and parallel processing Results are now being analyzed by wet-lab scientists with very promising results 12/1/2015 Jan Oppelt, NCBR & CEITEC MU 16

17 User(!) comments to MetaCentrum 12/1/2015 Jan Oppelt, NCBR & CEITEC MU 17

18 My comments to MetaCentrum Not a hard-core informatician Most of the comments have their reason I just don t know them Refer just from a point of view of a user I didn t actively try to solve the issue or contacted anybody I kindly ask MetaCentrum people not to stone me 12/1/2015 Jan Oppelt, NCBR & CEITEC MU 18

19 Issues Not so easy to install tools just for testing some tools are difficult to install to a local folders Tools using Python/Perl (including modules) different clusters have different Py/Pl versions that might not be compatible Requested space at SCRATCH is confirmed but not always available crushing jobs No intermediate logs to look at created at the end of job Parts of MetaCentrum Wiki contain the same information at various pages but with slight differences is it important? Not experienced user cannot know 12/1/2015 Jan Oppelt, NCBR & CEITEC MU 19

20 Thank you for you attention I welcome any question and/or comments Jan Oppelt NCBR & CEITEC MU jan.oppelt@mail.muni.cz 12/1/2015 Jan Oppelt, NCBR & CEITEC MU

21 12/1/2015 Jan Oppelt, NCBR & CEITEC MU 21

22 12/1/2015 Jan Oppelt, NCBR & CEITEC MU 22

23 RNA from sample(s) mrna 5 Fragmentation Sequencing Mapping to genome Statistical analysis Results and interpretation 12/1/2015 Jan Oppelt, NCBR & CEITEC MU 23