The Final Frontier. Data Analysis. Jean Jasinski, Ph.D. Field Application Scientist Sept. 27, 2017

Size: px
Start display at page:

Download "The Final Frontier. Data Analysis. Jean Jasinski, Ph.D. Field Application Scientist Sept. 27, 2017"

Transcription

1 The Final Frontier Data Analysis Jean Jasinski, Ph.D. Field Application Scientist Sept. 27, For Research Use Only. Not for use in diagnostic procedures.

2 Final Frontier: Data Analysis Agenda Introduction SureDesign (earray) Microarray Data Analysis NGS Data Analysis Cartagenia (Alissa) 2 For Research Use Only. Not for use in diagnostic procedures.

3 Standard Disclaimer Except for GenetiSureDX and Cartagenia Bench, all other products are Research Use Only (RUO) 3 For Research Use Only. Not for use in diagnostic procedures.

4 Final Frontier: Data Analysis Agenda Introduction SureDesign (earray) Microarray Data Analysis NGS Data Analysis Cartagenia (Alissa) 4 For Research Use Only. Not for use in diagnostic procedures.

5 Precision Medicine Needs Precision Genomics High resolution, accuracy and sensitivity Key Technologies Next Generation Sequencing Microarrays Digital PCR qpcr Oligonucleotide FISH 5 For Research Use Only. Not for use in diagnostic procedures.

6 Puzzled by Options? 6 For Research Use Only. Not for use in diagnostic procedures.

7 Final Frontier: Data Analysis Agenda Introduction SureDesign (earray) Microarray Data Analysis NGS Data Analysis Cartagenia (Alissa) 7 For Research Use Only. Not for use in diagnostic procedures.

8 SureDesign and earray Create and View Custom and Catalog Designs SureDesign earray Gene Expression microarrays mirna microarrays RNA-Seq targeted capture Mutagenesis (QuikChange HT) 8 For Research Use Only. Not for use in diagnostic procedures.

9 SureDesign and earray Web-based tools Same login for both tools Must use institutional for account Create custom designs Customize catalog designs Download designs Order designs (trigger quote) Free to use 9 For Research Use Only. Not for use in diagnostic procedures.

10 Final Frontier: Data Analysis Agenda Introduction SureDesign (earray) Microarray Data Analysis NGS Data Analysis Cartagenia (Alissa) 10 For Research Use Only. Not for use in diagnostic procedures.

11 Microarray Data Analysis Tools 11 Agilent CGH and CGH+SNP arrays (human and nonhuman) True two-color analysis Copy Number LOH and UPD (CGH+SNP) Suppress, classify, edit, annotate aberrations Report generation Free CytogenomicsDX GenetisureDX array analysis FDA-cleared Free For Research Use Only. Not for use in diagnostic procedures. Gene Expression arrays mirna arrays Exon and Exon Splicing Arrays Copy Number Clustering, GEO, GO, GSA Pathway Analysis Multiple vendor arrays License fee MPP (Mass Profiler Pro) Metabolomics and proteomics from Mass Spec data License fee

12 Final Frontier: Data Analysis Agenda Introduction SureDesign (earray) Microarray Data Analysis NGS Data Analysis Cartagenia (Alissa) 12 For Research Use Only. Not for use in diagnostic procedures.

13 OneSight Seeing is Knowing The OneSight cfdna solution allows labs to study the (aneu)ploidy status of the DNA found in the cell-free fraction of a biopsy sample from lowpass whole genome sequencing data. Key features: Vizualisation tools: detailed views (aneu)ploidy status of each chromosome All chromosomes Automation tools: define classification rules for marking loci for review Reference sets: define normal samples in the study population Excluded regions: remove recurrent technical noise and biologically irrelevant loci in the data Research Use Only. Not for use in diagnostic procedures.

14 OneSight Turnkey solution OneSight Compatible with any common NGS library prep kit Compatible with the most common NGS sequencing platforms Upload raw NGS data Select analysis pipeline and reference set Visually inspect chromosome plots Research Use Only. Not for use in diagnostic procedures.

15 OneSight Visual plots Normal Segmental aberration Trisomy Complex aberrations -Developed by Cartagenia, a part of Agilent Technologies, leveraging the company s expertise with software solutions for genetics labs -Proven SaaS approach and technology platform -Workflow efficiency, traceability & versioning, and automation -Setup fee and per sample analysis Research Use Only. Not for use in diagnostic procedures.

16 SureCall Alignment to Mutation NGS data analysis tool for biologists Accepts fastq or bam files Generates vcf (4.2) and pdf or text mutation reports Human (hg19) DNA analysis only Free to Agilent Target Enrichment customers (HaloPlex, SureSelect, OneSeq) Runs on local computer 16 For Research Use Only. Not for use in diagnostic procedures.

17 SureCall 4.0 New Features Support SureSelect XTHS Data Analysis Add Molecular Barcode (MBC) analysis for SureSelect XT HS Improves MBC analysis flexibility Indexing hopping control including optical duplication removal and estimated index hopping frequency parameter Additional QC metrics and plots for HS analysis Introducing Translocation Detection New algorithm module New visualization Overall Software Improvement Check for internet connection while submitting the job. If the connection is not available, a pop-up message to warn user that without internet connection, annotations result will be affected. Now allow re-annotation for updating an analyzed sample or finishing up a failed job due to network issues Provide link out to EXAC (Exome Aggregation Consortium, hosted by Broad Institute) while in Triage View. Improved login dialog Better installer, checks system/hardware compatibility first Support VCF v4.2 format, which include all variant types (SNPs, Indels, CNVs, translocations, etc.) from a sample) QC report improvements (e.g. include SureCall version, Design ID, Genome Build in the report). 17 For Research Use Only. Not for use in diagnostic procedures.

18 Choose one of the four analysis types available in SureCall Single Sample Analysis Description Result For individual samples SNPs, indels, translocations Pair Analysis To determine copy number changes (use a normal reference). To determine somatic mutations in tumornormal samples Trio Analysis For trios, typically mother, father and child OneSeq Analysis For simultaneous detection of genomewide copy number changes, cnloh, SNP and Indel mutations SNPs and indels CNVs Somatic mutations SNPs and indels de novo mutations CNVs, cnloh, SNPs and Indels 18 Research Use Only. Not for Use in Diagnostic Procedures.

19 SureCall Support of HaloPlex HS and XT HS molecular barcodes 19 Research Use Only.. Not for use in diagnostic procedures.

20 What are Molecular Barcodes (MBC)? Also known as Unique Molecular Identifiers (UMI) or Random Molecular Tags (RMT) The goal is for each original DNA fragment, within the same sample, to be attached to a unique sequence barcode Although similarly named, these are not the same as a sample barcode/index which allow for multiple samples to be run on a single sequencing run Molecular barcodes are a string of totally random nucleotides (such as NNNNNNN), partially degenerate nucleotides (such as NNNRNYN), or defined nucleotides (when template molecules are limited) Agilent uses 10-base MBC DNA Adaptor Sample Index Molecular Barcode Research Use Only. Not for use in diagnostic procedures.

21 Why are Molecular Barcodes Useful? In Capture based technology (SureSelect HS ): Able to identify original DNA fragments with bias from fragmentation methods With deep sequencing, able to use duplicate reads for error correction In Amplicon Based technology (HaloPlex HS ): De-duplication ability to determine original DNA fragments and PCR duplicates In Both: Accurate low allele frequency variant calling Calling of copy number changes Error correction introduced by PCR and sequencing Research Use Only. Not for use in diagnostic procedures.

22 De-duplication Capture without MBC Reference Genome Exon of interest When you de-duplicate reads that have the same start and stop point, all will be removed (discarded) except for one read. Research Use Only. Not for use in diagnostic procedures.

23 De-duplication Capture with Molecular Barcodes Reference Genome Exon of interest When you de-duplicate using molecular barcodes, the reads that have the same start stop point are not removed but are merged together to create consensus reads. This way, errors introduced by PCR or sequencing are removed. Research Use Only. Not for use in diagnostic procedures.

24 De-duplication Amplicon without MBC Reference Genome Exon of interest When using amplicon technology de-duplication really isn t possible because of the nature of the amplicons the majority of the sequencing data would be lost. For Research Use Only. Not for use in diagnostic procedures.

25 De-duplication Amplicon with MBC Reference Genome Exon of interest When using amplicon technology with molecular barcodes, it becomes possible to de-duplicate and identify the unique molecules of DNA. The reads that have the same molecular barcode can then be used to create consensus reads and remove errors created by the library prep or sequencing processes. Research Use Only. Not for use in diagnostic procedures.

26 Low Allele Frequency Variants (<3%) Low allele frequency variants are difficult to detect by conventional NGS methods Relatively high error rate of sequencers Sequencer Error rate Error type Illumina MiniSeq & NextSeq <1% Substitutions Illumina MiSeq & HiSeq 0.1% Substitutions Ion Torrent PGM, Proton & S5 1% Indels & homopolymers PacBio 13% single pass 1% circular consensus read Indels Oxford Nanopore MinIon 12% Indels Adapted from Goodwin et al (2016) Nature Reviews Genetics 17: Research Use Only. Not for use in diagnostic procedures.

27 Detecting low allele frequency variants and DNA Inputs Perfect world (0.1% allele frequency) 4 reads to create a consensus therefore: 4000x coverage would be sufficient = 4000 original copies of the genome (2000 cells) 12ng of DNA input required In reality, library prep is inherently inefficient Input 4000 End repair & A tail 3900 Ligation 2500 Hybridisation 1750 Capture 1250 Clean up 1000 Library 900 Conclusion: To detect low allele frequency variants, higher DNA inputs are required Adapted from: Research Use Only. Not for use in diagnostic procedures.

28 Analysis Pipelines other than SureCall For customers with established bioinformatics pipelines, Agilent provides two separate java programs in AGeNT (Agilent Genomics NextGen Toolkit (AGeNT) that can be integrated into your pipelines: SureCallTrimmer and LocatIt SurecallTrimmer is called before alignment and handles adapter trimming (on both ends), trims low quality bases, and masks enzyme footprints. SurecallTrimmer is important for HaloPlex and HaloPlex HS data not processed in SureCall MBC reads are found in third fastq file Generation of consensus reads occurs after alignment by examining all reads that align to the same location (chr, start, stop) and share the same molecular barcode LocatIt handles MBC after alignment: consensus reads, filtering based on MBC, optical deduplication, etc. Must have bam file and MBC fastq Tools for bioinformaticians capable of developing and debugging pipelines 28 For Research Use Only. Not for use in diagnostic procedures.

29 Other Types of NGS Analyses: Non-human or Other Type of Sequencing (RNA-, small RNA-, Methyl-, medip-, or ChIP-Seq) SureCall only performs DNA analysis for human (hg19) data only StrandNGS can align DNA, RNA, and small RNA using its own aligner or accept BAM or SAM inputs Workflows for DNA-Seq, RNA-Seq, small RNA-Seq, Methyl-Seq, MeDIP-Seq, and ChIP-Seq using algorithms specific to experiment type Powerful QC tools Extensive filtering options Pathway, GO analysis, clustering StrandNGS pipelines now available License fee 29 For Research Use Only. Not for use in diagnostic procedures.

30 Final Frontier: Data Analysis Agenda Introduction SureDesign (earray) Microarray Data Analysis NGS Data Analysis Cartagenia (Alissa) 30 For Research Use Only. Not for use in diagnostic procedures.

31 Enabling clinical analysis of genomic data Enables the interpretation, reporting, and sharing of genomic variants Manage increasing volumes of data and reduce turnaround time Draft clinical grade lab reports (FDA Class 1 Medical Device) Analyzed CGH and NGS data accepted as input Rebranded as Alissa Interpret

32 How Cartagenia Works Software as a Service Scalable Secure Cost effective Content is key! Knowledge Integration: Over 100 public and private data sources Institution specific repositories Sharing across private and public consortia Partnerships (Alamut, HGMD, OncoMD, CollabRx, N-of-1 ) Setting and Adopting Standards Adapting to diagnostic standards ISO9001 and ISO13485 certified Registered as Medical Device in US, Canada and Europe Support A fully-serviced solution Adapted to your needs, specialization and deadlines 32

33 Benefits of Cartagenia Bench Efficient Productivity through Automation Standardization Knowledge Integration Robust Validation Versioning Security & control High quality support Easy to use Co-designed with you and your peers Integrated with lab and hospital IT Clinical grade ISO Certification Class I medical device

34 Agilent Alissa Vision from raw data to report Make your work flow with Agilent Alissa Clinical Informatics for NGS One single platform from raw reads to lab reports Comprehensive QC metrics at your fingertips Alissa Interpret is Class I medical device (CGH and NGS) Alissa Align & Call (RUO future release) For Research Use Only. Not for use in diagnostic procedures.

35 Bonus Content Index Hopping 35 For Research Use Only. Not for use in diagnostic procedures.

36 Index Hopping (Illumina Sequencers) Incorrect assignment of reads to different sample Occurs in multiplexed samples Frequency is higher on patterned flow cells (ExAmp chemistry) but still occurs in bridge amplification chemistry Multiple causes (index contamination, sample contamination, postcapture PCR mispriming, excess adapters, overclustering) Detection best done during demultiplexing when data from all samples is available Illumina s recommendations 36 For Research Use Only. Not for use in diagnostic procedures.

37 Observed index hopping rate using XT HS : Hiseq4000 vs. Hiseq2500 P5 P5 We see an average hop rate** of 2.9% with HiSeq4000 (newer patterned flowcell). On HiSeq2500 (older non patterned flowcell), we see average hopping rate of %. MBC MBC Index hopping rate 3.5% Insert1 Insert1 3.0% 2.5% 2.0% Index1 Index2 Index2 1.5% P7 P7 P7 1.0% 0.5% 0.0% HiSeq4000 HiSeq2500 **: Hop rate = hopped reads/ total reads Libraries are prepared and enriched individually, so hopping observed has occurred at sequencing level 37 October 23, 2017 For Research Use Only. Not for use in diagnostic procedures.

38 What does this mean for your application*? For pooling of samples from the same germline application using XT low input or SureSelect XT, XT2 Assuming <5% alleles are not called Customers should not be concerned about index hopping For pooling of samples from the same somatic application using XT HS Variant calls with >5% alleles are likely not due to index hopping Variant calls with <5% alleles, might be impacted by index hopping. For heterogeneous pooling across applications, or of samples across species, single cell, microbiome, viral, RNA expression, etc. Variant calls are possibly impacted by index hopping Consider index hopping risks when determining what samples to pool for sequencing *: index misassignment discussed here is limited to hopping at sequencing level; HiSeq 2500 data suggest other source of misassignment, such as index purity, are insignificant by comparison 38 October 23, 2017 For Research Use Only. Not for use in diagnostic procedures.

39 Index Hopping Physical Corrections Use one sample (exome) per lane Do not use precapture pooling as PCR of multiplexed samples may misprime and cause index hopping Pool libraries right before sequencing and sequence pooled library as soon as possible Freeze pooled libraries at -20 C Remove as much free adapters and PCR primers as possible; second bead cleanup if see small MW blip If sample barcode is comprised of dual indexes, do not use all possible combinations of indices so illegal combinations can be detected and removed 39 For Research Use Only. Not for use in diagnostic procedures.

40 XT HS molecular barcode thresholding - Bioinformatically remove hopped reads Fragments with multiple reads (same MBC) Fragments with single read 1 The vast majority of hopped reads, have just 1 read, regardless of sequencing depth 2 One way to minimize the impact of hopped reads is to remove all single reads (MBC thresholding). No error correction utilizing MBC with these reads anyway. 3 This will work well for low allele frequency applications where error correction with MBC is needed. (All colors): molecular barcode Good reads Hopped reads 40 October 23, 2017 For Research Use Only. Not for use in diagnostic procedures.

41 How effective is MBC thresholding on HiSeq4000? % H o p r a te (h o p p e d r e a d s /to ta l r e a d s ) M B C (1 + ) M B C (2 + ) MBC2+ means all single MBC reads are filtered out, i.e. MBC thresholding MBC thresholding results in a 10x reduction in hop rate, from average of 2.9% to 0.3%, close to observed hopping level on HiSeq 2500 w ith v s. w ith o u t M B C th r e s h o ld Each data point is average hop rate of 2-3 HiSeq4000 runs per given sample. Data include 3-plex, 4-plex and 8-plex lane runs 41 October 23, 2017 For Research Use Only. Not for use in diagnostic procedures.

42 Impact of MBC Thresholding on XT HS Sensitivity MBC thresholding on HiSeq 4000, while removing significant amount of sequencing data, shows little to no negative impact on assay sensitivity. Expected HiSeq2500 MBC1+ HiSeq4000 MBC1+ HiSeq2500 MBC2+ HiSeq4000 MBC2+ >2% known Variants <=2% known Variants* false positive (or unknown Variants)** Total Sensitivity 88.75% 88.75% 87.50% 88.75% Specificity 99.93% 99.86% 99.97% 99.97% Precision (PPV) 55.47% 40.34% 74.47% 74.74% Without thresholding, False positive rates are significantly higher with 4000 (low specificity) 77kb panel, 10ng input, 10,000X sequencing depth *: 2 of 21 have expected frequency of 1-2%. both are detected. The rest 19 are <=1% **: True variant calls are based on genome in a bottle. False positive count could include unknown true variants. HiSeq 4000 with MBC thresholding, comparable sensitivity and specificity to HiSeq 2500*** 42 October 23, 2017 For Research Use Only. Not for use in diagnostic procedures.

43 MBC Thresholding for Hopped Reads in SureCall MBC2+ is set by default 43 October 23, 2017 For Research Use Only. Not for use in diagnostic procedures.

44 SureCall Estimated Hopping Frequency New parameter reduces noise generated by sample index cross-contamination Default setting is (0.5%) Range is 0 to 0.1 (0 to 10%) How SureCall uses this parameter: 1) Calculate Read numbers of variants could caused by indexing hopping = Average coverage of each region X Estimated Index Hopping Frequency 2) Based on the reads number from the 1 st step, SureCall calculates the probability of certain variant calls to be real or noise caused by index hopping and filters out such mutations Estimate value by comparing SureCall allele frequencies with known allele frequencies, from data for the particular sequencer used (higher in patterned flow cells), past experience Number of variants that might be due to index hopping 44 For Research Use Only. Not for use in diagnostic procedures.

45 Optical Duplicates These are only a problem for HiSeq 2500/MiSeq/NextSeq data. They come from large clusters being called incorrectly as two separate clusters by Illumina s RTA SW. On a 2500: Some clusters are either too big or their shape does not conform to the model and they get counted as 2+ clusters. On a 4000: During amplification on the flow cell one of the local duplicates that are part of a growing cluster break free and go on to seed a new nanowell and start a cluster of its own nearby. After analysis, these two nanowells show the very same data: sequence and MBC. The geographical coordinates are close to each other. SureCall uses geographical location (tile) for optical deduplication before MBC deduplication 45 For Research Use Only. Not for use in diagnostic procedures.

46 For Research Use Only. Not for use in diagnostic procedures.