Tools and Algorithms in Bioinformatics GCBA815/MCGB815/BMI815, Fall 2017 Week-8: Next-Gen Sequencing RNA-seq Data Analysis Babu Guda, Ph.D. Professor, Genetics, Cell Biology & Anatomy Director, Bioinformatics and Systems Biology Core University of Nebraska Medical Center Fall, 2017 GCBA/MGCB/BMI 815 Sanger vs Next-Gen Sequencing Source: https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&ved=0ahukewj356gajzwahxezfqkhzrlch0qjrwibw&url=http%3a%2f%2fslideplayer.com%2fslide%2f11674461%2f&psig=aovvaw3bhydsg4jhy9z4y3jc11iy&u st=1507933065294289 1
Next-Gen Sequencing No in vivo cloning Source: https://bloggenohub.files.wordpress.com/2015/01/slide1.jpg Cost of Human Genome Sequencing Source: http://blog.dnanexus.com/wp-content/uploads/2017/04/screen-shot-2017-04-24-at-11.40.38-am.png 2
Next-Gen Sequencing Workflow Source: Lu and Shen, 2016, Biochemistry, Genetics and Molecular Biology. DOI: 10.5772/61657 Applications of NGS Genome Whole genome sequencing Whole exome sequencing Targeted gene panels (cancer, newborns, autism, etc.) Transcriptome Whole RNA sequencing mrna transcriptome (poly-a selection) Small RNA analysis (sirna, snorna, lincrna, etc.) Gene expression profiling for selected target genes Metagenome Bulk sequencing of many types of bacteria Examples: human gut microbiome, soil samples, food contamination, extremophiles, etc. Epigenome Chromatin Immunoprecipitation Sequencing (ChIP-Seq) Methylation Sequencing (Methyl-Seq) 3
Different Sequencing Libraries Source: http://slideplayer.com/7847747/25/images/7/types+of+sequencing+libraries.jpg Paired-end Sequencing Source: https://assets.illumina.com/content/dam/illumina-marketing/images/science/v2/web-graphic/paired-end-vs-singleread-seq-web-graphic.jpg 4
FASTQ Files from Paired-end Sequencing Source: https://bioinf-galaxian.erasmusmc.nl/galaxy/ Demultiplexing Mixed Samples Source: https://www.illumina.com/content/dam/illumina-marketing/images/technology/multiplexing-overview-figure.gif 5
Different File Types in NGS analysis Fastq file generated by the sequencer, contains NGS reads SAM file Sequence Alignment/Map (generated by aligning the NGS reads with the reference genome) BAM file Binary version of the SAM file (SAMtools are used to manipulate SAM/BAM files) GFF file General Feature Format used to hold genome annotation (chromosome, strand, frame, exon, CDS, etc.) GTF file Gene Transfer Format (Also contains all the info as in GFF and in addition contains gene annotation information) VCF file Variant Call Format (used to store variant data such as SNPs, InDels, short structural rearrangements) Fall, 2017 GCBA/MGCB/BMI 815 Fastq @SRR098401.11403008/1 GAGGCTATAGCATGGTCAAGGCACAAGAAGATCACTGGACTGCCCTCGCTCAGCCCTCAGCTACTG + >>?>?@>?>@@>?@@=@@@@@??>??@??@?@A?>@@@?>@@???A@:@A@@A@@@A@@AAB@@BB Row 1: Information from the sequencer about the location of this read on the plate Row 2: The Sequence Row 3: Metadata provided by the sequencing team Row 4: Quality scores pertaining to each nucleotide in the sequence 6
FASTQ format: FASTQ is based on the popular FASTA format for sequences FASTA format >sequence_id; header in one line AGTTGTAGTCCGTGATAGTCGGATCGG FASTQ format provides additional information that includes the quality score @20FUKAAXX100202:1:64:10634:114560/1 TTGTATTTTTAGTAGAGACGGAGTTTCGCCATGTTGGTCAGGCTGGCCTCGAATTCCTGACCTCAAGTGATCCGCCCGCCTCGGCCTCCCAACGTTTTGG +?=@7=>B==;;BB?<B?=8539<6?6>8>=BB<<B=08:9@5;:A@@?@9:BAAA<?;8;@AC@BBBBBA?<9-@B@;CAA77<:BEB<BB@07?@=<?84 ASCII code for Quality score (Phred score, ranges from 0-50) ASCII code for Quality score (in the increasing order;! is the worst and ~ is the best Fall, 2017 GCBA/MGCB/BMI 815 Sequence Alignment / Map (SAM / BAM) SRR098401.104031357 83 chr22 17445857 60 76M = 17445512-421 ACTGTTACCAGATCAAGAACTGATAGGGACAGGGATCATTATTCCCCCTTTACAGATGAGAAGGCCGTCACGCCTC @@>>B@@@BBAAAB9A@@>:@@?=A@?@?@A???>?@??=???@@@@@>@>>@@@><??@>@>@@8?>?=:@>?>> BD:Z:NOJKPQQQQMONOMKKKLNOMNLLLJLMINLJLMLMLKKKKJLJJJMKCKLINJMMLJKKKMOOMNNOLPQSNMK K PG:Z:MarkDuplicates RG:Z:NA12878 BI:Z:OOMLRRPPRPPQQONOLOPOONOOOKLNMONJKMNONMMMMLMKKKMLGMNLNMMNNJMJLNOMLNMPNONONNM M NM:i:0 MQ:i:60 AS:i:76 XS:i:0 Similar to the Fastq file in that it contains the raw sequence and its quality scores. It also tells you where the sequence aligned to the genome, and how well (this scre is also phred-scaled). In this case, this read aligned to chromosome 22, position 17445857, and has a quality score of 60 (or a 1 in 1,000,000 chance of being placed incorrectly). 7
Variant Call Format (VCF) RNA-Seq Data Analysis 8
Computational Analysis of RNA-Seq Data Source: Conesa et al., Genome Biology, 2016, 17:13 RNA-Seq Data Analysis Workflow Illumina, Ion Torrent, PacBio FastQC, FQTrim STAR, HISAT, TopHat, Sailfish, Salmon Cufflinks, EdgeR, DESeq CuffDiff, DESeq, DegeR, Limma GSEA, IPA, DAVID, GO, etc. 9
Input Files for RNA-seq Analysis Download Test Data file from the Course Page and unzip the folder Galaxy Server https://usegalaxy.org/ A large compilation of open-source NGS data analysis tools that are accessible to users on web-based platforms Data can be uploaded from a PC/Mac and computing can be done on the cloud No need to install tools and maintain servers locally In-depth tutorials are available to use Galaxy services A list of Public Galaxy Servers can be found at https://galaxyproject.org/public-galaxy-servers/ Today s RNA-seq analysis will be performed from the following link https://bioinf-galaxian.erasmusmc.nl/galaxy/ 10
Phred Score (Q) explained Phred&score&(Q)&vs&Error&probability&(P)& Q = 10 log10 P & Base Sequence Quality Interpretation Bad Quality Excellent Quality Quality drops at the tail end Bad Quality 11
Read Mapping and Assembly Source: https://home.cc.umanitoba.ca/~frist/plnt7690/lec12/lec12.3.html Downstream Analysis of RNA-seq Results Hierarchical Clustering IPA: Ingenuity Pathway Analysis GSEA- Gene Set Enrichment Analysis Source: Yoo et al., Nature Genetics, 2014 Source: Li et al, Scientific Reports, 2015 Source: Graner et al, Front. Oncology, 2015 Source: Bee et al., PLoS ONE, 2011 12