Workflows and Pipelines for NGS analysis: Lessons from proteomics

Size: px
Start display at page:

Download "Workflows and Pipelines for NGS analysis: Lessons from proteomics"

Transcription

1 Workflows and Pipelines for NGS analysis: Lessons from proteomics Conference on Applying NGS in Basic research Health care and Agriculture 11 th Sep 2014 Debasis Dash

2 Where are the protein coding genes in a genome Genome annotation ATGAAGAAGCTGTGTGCTTTCACTATTGCCTTTTTTTCCCTGAAGTTTTGTCTCATCTTGTGCAGTTTGACTGAACCCAATTGCTTTTGGAAGATAAAGAAGAGAGAAGTTAATGAT GGAGATTTGCAAAATGAGTGTGGTTTTGTCCTTTTTACACTTGAGAGCCCTATTGAAGAAAATTTTTATAATCACATTATTAATTTTAGGATACCAGCAAGAAAATATGAATTTTTTC TGGTAATGTTTTTTGCTACTGATGAGATCAACAAGAATCCTTATCTTTTATCCAACATGTCTTTGATATTTTCCTTCATTTTTGGTATGTGTGAAGATACAATGGGAGTTCTGGATAA AGCATATTTACATCAAAACAACTATTTCGATCTACTTAATTATAACTGTGGAAGAAAGAAACGTTGTGATGTAAAACTTACAGGACCATCATGGAAAACTTCCTTAAAACTTTCAGT TAATTCAAGGGCACCAAAGATTTTCTTTGGACCATTTAATCCTAACCTGAGTGACCATGACCAGTTTCCCTATATCTATCAGATAGCAACCAAGGACACATATTTGCTCCATGGCAT GGTCTCCTTGATGTTTCATTTTGAATGGACTTGGATAGGACTGATCATCACAGATGATGACCAAGGTATTCAGTTTCACTCAGACTTGAGAGAAGAAATGCAAAGGCATGCGATCT GTTTAGCTTTTGTGATTATGATCCCAGAAAGCATTAAGTTATACAACACAAAGTTTAAGATATATGACCAACAACTTATGACATCTTCAGCAAAGGTTACTATCATTTATGGCAAAA TGATCTCCACTCTAGAACTCAACTTTGCAAGATGGACATATTTAGTTGCACGGAGAATCTGGATCACAACCTCAAAATTGGATGTCATCACATATGATAAAGATTTCAGCCTTGATT TCTTCCACGGGACTGTCATTTTTGCCCACCACCACAATGACATCGCTACATTTAGAAATTTTATGCAAATAATAAACACATCCAAGTATCCAGTAGATATTTCTCAGTCTATGGGGCA GTGGAATCATTTTAACTGTTCAATCTCAAAGAACAAGAAGAAAATGGATTTTTTTATGTTGAAAAACCCAATGGAATGGTTAACACAGCACACATTTGACATGGTCCTGAGTGAAG AAGGTTACAATTTGTATAATGCTGTGTATGCTGTGGCCCACACCTATCACGAACTCATTTTTCAACAAGTAGAGTCTCAGGAAATGGCCAAACCCAAAGGACTATTCACTGACTGT CAGCAGGTGGCTTCTTTGCTTAAAACTAGGGTATTTACTAACCCTGTTGGAGAGCTGGTGAACATGAATCATAAGGAAAATCAGTGTGCCAAGTATGACATTTTCATCATTTGGAA TTTTCCAAATGGCCTTGGATTAAAAGTGAAAATAGGAAGCTATTTTCCTTGTTTGCAACAGAGTCAACATCTTCATATATCTGAAGACTGGGAGTGGGTTACAGGAGAAACATTGG TTCCCTCCTCAGTGTGTAGTGAGACATGTACTGCAGGATTCAGAAAAAGTCATCAGAAACAAACAGCCAACTGCTGCTTTGATTGTGTCCAGTGCCAAGAAAATGAGATTGCCAAT

3 Importance of genome annotation Transcriptome Systems biology Proteome Interactome Genome annotation Structural biology Metabolome Reactome Armengaud J. Proteogenomics and systems biology: quest for the ultimate missing parts. Expert Rev Proteomics. 2010

4 Solving a puzzle when pieces are missing or broken

5 How proteins are detected from samples? Protein Extraction Protein Database Protease Digestion A high-throughput method of protein identification Theoretical Peptide digestion LCMS Peptide fragmentation simulation Experimental MS/MS Spectrum MS1 MS/MS Theoretical MS/MS Spectrum Peptide Spectrum Match Scorer

6 Proteomics: Challenges Identified Unidentified A large fraction of experimental spectra remain unidentified. May be because of Unknown modifications on the peptides Limitations of search algorithm Noisy Spectra Spectra are from non-peptidic origin Peptides are missing in the search database

7 Target Decoy scores Controlling error rates through decoys Concatenated target-decoy search* Separate target and decoy search** FDR= 2 x decoy/ (target +decoy ) FDR = decoy/target Threshold score * Nature Methods - 4, (2007) **. J. Proteome Res., 2008, 7 (01), pp 29 34

8 MassWiz: An advanced algorithm for peptide discovery Intensity of matching peaks Continuity of y-ions & b- ions Neutral losses & Immonium ions Fragment mass error sensitive scoring Yadav AK, Kumar D, Dash D. MassWiz: a novel scoring algorithm with target-decoy based analysis pipeline for tandem mass spectrometry. J Proteome Res. 2011

9 Algorithm comparison: PSMs Data: ISB standard protein mix

10 Proteomics: Challenges Identified Unidentified A large fraction of experimental spectra remain unidentified. May be because of Unknown modifications on the peptides Limitations of search algorithm Noisy Spectra Spectra are from non-peptidic origin Peptides are missing in the search database

11 Proteogenomics: An alternate proteomic search strategy

12 Proteogenomics: An alliance of Genomics and Proteomics Genome Annotation Proteomic identifications Known Peptides Novel Peptides Novel Gene Gene on different frame Gene on opposite strand Gene model change Armengaud J. A perfect genome annotation is within reach with the proteomics and genomics alliance. Curr Opin Microbiol. 2009

13 Genomics Proteomics Lack of analysis-pipeline/software for integration of proteomics data with genome or genomics data

14 Bridging the Gap Developing computational strategies to identify novel protein coding loci from MS data Methods for identifying splice variants from proteomics data and discovery of novel translation products in eukaryotic model organism

15 Proteogenomic analysis of Mycobacterium tuberculosis

16 Does Mycobacterium tuberculosis need re-annotation? Genome size: 4.4 mb (1998) Cole et al ORFs annotated in the first genome draft genes in re-annotation. (Camus et al 2002) 3988 protein coding genes (NCBI Refseq) 3987 protein coding genes (Sanger Institute) 3918 protein coding genes (TIGR/JCVI) 50% of the genes vary in Translation initiation site (TIS) between Sanger and TIGR annotations (desouza et al 2008) 4,012 protein coding genes (Tuberculist R21)

17 Deep proteome profiling is achieved 123 LCMS runs of cell lysate and culture filtrate of Mtb H37Rv 3176 out of 3988 NCBI Refseq proteins (80% Mtb proteome) identified Translational evidence for 829 Hypothetical proteins 233 of 829 hypothetical proteins identified for the first time Identified hypothetical 21% Identified 59% Unidentified 20% In collaboration with Dr. Akhilesh Pandey & IOB

18 Mtb H37Rv: Novel Translations Kelkar DS, Kumar D et al Proteogenomic analysis of Mycobacterium tuberculosis by high resolution mass spectrometry. Mol Cell Proteomics Novel protein coding loci Changes in 79 existing gene models Correction in TIS for 33 and confirming for 868 proteins Conservation of Novel proteins

19

20 Challenges in proteogenomics MassWiz Database creation Gene coordinate comparison X!Tandem Peptide classification Spectra processing Peptide mapping to genome A solution with complete automation Result reporting and high fidelity of results is required Peptide Assignment OMSSA FDR estimation Visualization InsPecT

21

22 Integrating results from multiple algorithms: Set theory Peptides identified by multiple algorithms have low false positives but this method does not allow to control or estimate false discovery rate

23 FDR Integrating results from multiple algorithms: FDRscore FDR FDR FDR OMSSA E-value X!Tandem P-value Inspect P-value MassWiz Score Metrics from multiple algorithms are not comparable FDR values from individual algorithms can be processed to generate a common score Jones AR et al, Proteomics 2009 FDR Q-value FDRscore Q-value FDR FDRscore based result integration allowed statistical Q-value assessment (FDR) of final results FDRscore Q-value Score P-value Score P-value

24 Novel Proteome of B. japonicum 59 novel proteins identified 51 Novel proteins with 2 or more unique peptides Single peptide hits are selected if identified in minimum 2 samples and after manual inspection 49 gene model changes identified Translated start site suggested upstream to current annotation TIS confirmed for 21 genes TIS correction for 1 genes

25 A novel protein reveals a novel operon ORF length Codon Bias Promoter region Ribosome binding site FgeneSB operon

26 A gene model change

27 Novel peptides are distributed throughout the genome Is there a common theme of novel identifications? Novel proteins are short Most novel proteins are short proteins TTG start codon in Gene model changes

28 A methylotroph- Organisms with ability to grow on reduced carbon compound like methanol or methylamine Ecologically important- Supports vegetation by producing phytohormones Industrial application- In production of important chemicals and bio-molecules on methanol feedstock Model organism- to study methylotrophic metabolism Member of Methylobacteriacea family: A diverse taxonomy with many genes specific to one genome

29 31 Novel protein coding genes 70 gene model changes 104 methylotrophy gene products 2,678 Proteins

30 Limited conservation and Low GC content of novel genes suggest Lateral gene transfer as probable mode of origin

31 Developing computational strategies to identify novel protein coding loci from MS data Methods for identifying splice variants from proteomics data and discovery of novel translation products in eukaryotic model organism

32 Exon junction peptides for detecting splice variants 1 Exon boundary peptide 2 Splice variant 3 New exon 4 A new 3 splice site 5 A new 5 splice site

33 Eukaryotic Proteogenomics Gene Peptides Novel Peptides Peptides map on UTR Peptides map on INTRON Peptides map Peptides map on on NON-CODING INTERGENIC GENE Peptides map on Opposite Strand Junction Peptide map on INTRON Peptides map on Different translation frame

34 Proteogenomics: Prokaryotic vs. Eukaryotic Prokaryotic Proteogenomics Eukaryotic Proteogenomics

35 RNA-seq analysis pipeline to capture transcriptome >TCONS_ gene=xloc_ loc: ATTTTGGAGTTGTGTAGCCAAT.. >TCONS_ gene=xloc_ loc: AAGGTTCAAGGTACAAGGTGGGGTATGCC >TCONS _420_548_3 TQTHIGQGRDEYLYDSHGSLSRPSSMSTSLPFNRASEHGICC >TCONS _769_999_1 SSKVWWLKYTWPMASGSVRRYGLFGVDVAFEEVCHCGGGMGF. 7 1 Raw Rna-Seq reads from NCBI-SRA repository 2 Read QC and processing using Trimmomatic 3 Filtered read mapping on reference genome using STAR aligner 4 Transcript assembly by Cufflinks 5 Assembly QC and comparison using cuffcompare and BLAST 6 Fasta of all transcripts generated using gffread 7 Theoretical translated protein database

36 EuGenoSuite: Integrates transcriptomics to proteomics >TCONS_ gene=xloc_ loc: ATTTTGGAGTTGTGTAGCCAAT.. >TCONS_ gene=xloc_ loc: AAGGTTCAAGGTACAAGGTGGGGTATGCC >TCONS _420_548_3 TQTHIGQGRDEYLYDSHGSLSRPSSMSTSLPFNRASEHGICC >TCONS _769_999_1 SSKVWWLKYTWPMASGSVRRYGLFGVDVAFEEVCHCGGGMGF. 7 OMSSA GenoSuite X!TANDEM 8 Tandem mass spectra 9 Peptide identification 10 Protein grouping /Protein assembler 14 Novel / Known categorization

37 Genome size and annotation comparison Organism Genome Size $ (Mb) Annotated Proteins* Human 3, ,763 Mouse 2, ,165 Rat 2, ,725 *Ensembl release 74 $NCBI Genome

38 Case study dataset Brain Heart Liver 9 tissues and 3 replicate for each Sequencing instruments HiSeq 2000 Illumina GAII Lung Muscle Rattus norvegicus Spleen Testes Colon Kidney Sample 1 Sample 2 Sample 3 T1 T2 T1 T2 T1 T2 T1: Technical Replicate 1 T2: Technical Replicate 2

39 400 million Paired end Reads 2 Million MS/MS spectra Transcriptomic analysis pipeline EuGenoSuite 11,725 Peptides (1%FDR, identified in both T1 and T2) 11,413 mapped to known proteins 312 Novel Peptides (275 unique mapping) 25 UTR 14 intronic 145 intergenic 28 non coding loci 45 Spliced peptides 18 different frame

40 Discovery of splice variant for Threonyl t-rna synthetase

41 Translation of Pseudogene Pseudogene Paralog(PCBP2)

42 Rat Analysis Summary 105,380 unique transcripts assembled 2,900 Annotated proteins identified Transcripts and peptides for Eight Pseudogenes Translation of exons annotated as non-coding (15 genes) 45 splice variants detected

43 Translation of a novel gene locus

44 Summary Part 1 Proteomics data when searched against genomic background aids novel protein discovery GenoSuite : A fully automated multi-algorithmic proteomics and proteogenomics analysis tool Comprehensive proteogenomic analysis of B. japonicum improves protein annotation of rhizobia N-terminal acetylation of bacterial proteins Part 2 Integrated analysis of RNA-seq and mass spectrometry proteomics data tracks down novel protein isoforms EuGenoSuite : An in-house pipeline for eukaryotic proteogenomics Translation of pseudogenes in rat microglia

45 Conclusion Proteomics Transcriptomics Genomics GenoSuite Data Integration EuGenoSuite Novel Discovery Genome Annotation

46 Acknowledgements IGIB IT Team & IOB team IGIB friends and family

47 Thank you