Bioinformatics- Data Analysis

Size: px

Start display at page:

Download "Bioinformatics- Data Analysis"

Dennis Stone
5 years ago
Views:

1 Bioinformatics- Data Analysis Erin H. Graf, PhD, D(ABMM) Infectious Disease Diagnostics Laboratory, Children s Hospital of Philadelphia Department of Pathology and Laboratory Medicine, University of Pennsylvania

Filtering steps and tools Processing Bioinformatic tools when virus is known Bioinformatic tools when

2 Outline Goal: Raw data virus detection and/or typing/epi Making sense of large amounts of data can be intimidating Focus on tools you can go home and use immediately Pre-processing Quality analysis Filtering steps and tools Processing Bioinformatic tools when virus is known Bioinformatic tools when virus is unknown (Agnostic) Interpretation Important variables Sources of error Standardization and Validation

3 Pre-processing: Raw data

4 Pre-processing steps Goal: Refine sequence data to contain only the best quality reads to reduce downstream errors Select examples in Interpretation section Quality summary provided by instrument Bar code/adapter trim, instrument-specific filtering Secondary custom filtering

5 Pre-processing: Instrument report Example of Illumina Example of Ion Shotgun DNAseq Targeted FFPE sequencing

6 Pre-processing: Filter and Trim Reads are filtered and excluded based on instrument quality cutoffs Fastq files can be downloaded or analyzed directly through Apps (Illumina) or Plugins (Ion) Secondary analysis through Fastqc Bioinformatic suites capable of custom trimming

7 Q score Targeted DNAseq from FFPE

8 Processing Trimmed, quality filtered reads now ready for alignment Bioinformatic tools for known virus Bioinformatic tools for unknown virus (Agnostic)

9 Bioinformatics tools: known virus Goal: Generate full length virus sequence for downstream analysis Typing, epidemiology, resistance marker analysis Align to single (or list of) reference genome(s) Various alignment algorithms Custom trimming Fastq files: Virus reference genome: Creates Sequence Alignment Map (SAM) file

Graphical User Interface (GUI) Very intuitive and

10 Bioinformatics tools: known virus Bioinformatic suites Geneious CLC workbench Bionumerics Others Graphical User Interface (GUI) Very intuitive and user friendly Alignment plugins Pull reference genomes from NCBI

11 Bioinformatics tools: known virus Downstream analysis tools Annotation Phylogenetic analysis Giberson et al (2011) NAR Stephanie Mitchell, PhD

12 Bioinformatic tools: unknown virus (agnostic) Goal: Detect any virus sequence present in a clinical sample Bioinformatic suites previously mentioned Manually curate list of reference sequences to align reads against Web-based metagenomic pipelines OneCodex Taxonomer CosmosID

13 k-mer classification Wood & Salzberg (2014), Genome Biology 15(3)

14 OneCodex

15 Taxonomer

16 CosmosID

17 Pipeline feature OneCodex Taxonomer CosmosID Analysis Time* ~8 minutes ~5 minutes ~5 minutes Number of virus genomes 5,137 >90,000 5,025 Comparison between samples Yes No Yes Upload many samples at once Yes No Yes Searchable reference genome list Yes No No Independent view of virus hits No Yes Yes Visual manipulation No Yes Yes Connection to sequencer s cloud No No Yes *for 1 sample with 2 million reads

18 Pipeline comparison: Adenovirus from a conjunctival swab

19 Pipeline comparison: Adenovirus from a conjunctival swab Bioinformatic method Reads of Adenovirus Type of Adenovirus OneCodex 39,175 B Taxonomer 5,833 B CosmosID 39,930 B Manual alignment and BLAST analysis 38,573 B, serotype 3

20 Pipeline comparison: Enterovirus from a nasopharyngeal aspirate

21 Pipeline comparison: Enterovirus from a nasopharyngeal aspirate Bioinformatic method Reads of Enterovirus Type of Enterovirus OneCodex 2 Not typed Taxonomer 609 EV-D68 CosmosID 1,498 EV-D68 Manual alignment and BLAST analysis 1,174 EV-D68

22 Interpretation Goal: Make accurate prediction with sequence data Is the patient infected with virus X? Is this SNP real? Important variables Number of reads Location of reads Depth of coverage Sources of error PCR errors during library prep or cluster generation Read length (over-trimming) Sequencing errors Mapping errors Contamination

23 Interpretation You have results from web-based pipeline, now what? How do you decide what is real/meaningful? Manual confirmation is recommended Can differentiate real hits from false-positives Genotyping/resistance mutation detection SNPs result of quality issues

24 Interpretation: Important variables Number of reads Location of reads All in one region? Depth of Coverage Average coverage across the viral genome Coverage at each individual base

25 Depth of coverage example

26 Low read count example Manual alignment to Torque teno midi virus 2 reference genome= 0 reads

27 Interpretation: Sources of Error PCR errors during library prep or cluster generation Read length (over-trimming) Sequencing errors Mapping errors Contamination **Positive and negative controls can help with some of these issues

28 Mapping error example A= in silico Dolphin morbillivirus B= in silico low level measles virus Measles virus reference genome Measles virus reference genome Schlaberg et al (2017) Arch Path & Lab Med

29 Percent false-positive Rhinovirus reads in sample with contamination Contamination example A sample with 652,676 reads of Rhinovirus contaminates a neighbor with 100 reads of Rhinovirus neighbor was sequenced at depth of 1 million reads Rhinovirus reads in contaminating sample

30 Standardization and Validation NY State validation guidelines Minimum of Q20 per base and Q20 per mapped read FDA Draft Guidance Need for regulatory-grade sequence database Cutoff values for positivity ARUP, UCSF, ASM PPC & CAP MRC clinical validation guidance 5 million total reads per sample-cutoff in CSF Minimum of 3 non-overlapping virus gene reads- cutoff for positive result ASM-AAM NGS Report Interpretive guidelines Quality standards

31 Conclusions Lack of standardized NGS analysis protocols Laboratories should look to published guidance All analysis pipelines have limitations Number and diversity of curated sequences False-positive hits Data should be scrutinized Manual confirmation of pipeline hits Quality filtering

32 Resources References: Wadsworth validation guidelines: files/webdoc/id WGS NGS Molecular Guidelines for Isolates_0.pdf. ARUP, UCSF, ASM PPC, CAP MRC Validation guidance: Schlaberg RS, Chiu CY, et al (2017) Archives of Path and Lab Medicine. doi: FDA Draft guidance: cedocuments/ucm pdf Broad Institute Best Practices: ASM-AAM Report: Talk to your genomics colleagues, they have probably encountered many of the same issues already