Jenny Gu, PhD Strategic Business Development Manager, PacBio

IDT and PacBio joint presentation Characterizing Alzheimer s Disease candidate genes and transcripts with targeted, long-read, single-molecule sequencing Jenny Gu, PhD Strategic Business Development Manager, PacBio 1

Characterizing Alzheimer s Disease candidate genes and transcripts with targeted, long-read, single-molecule sequencing September 27, 2017 / Jenny Gu, Ph.D. For Research Use Only. Not for use in diagnostics procedures. Copyright 2017 by Pacific Biosciences of California, Inc. All rights reserved.

AGENDA -SMRT Sequencing technology overview -Recommended IDT capture workflow for SMRT Sequencing -Case Study: Alzheimer s Disease panel

ALZHEIMER S DISEASE (AD) Alzheimer s disease is the most common form of neurodegenerative dementia. 46.8M 131.5M Clinical characterization: Progressive loss of memory and deficits in thinking, problem solving, and language https://www.alz.co.uk/research/worldalzheimerreport2015.pdf Neuropathological characterization: Progressive cortical atrophy due to neuronal loss and characteristic intracellular and extracellular deposits of insoluble tau and amyloid β proteins http://www.reverseagingcentre.com/media/links/signs-ofalzheimers/ 4

ALZHEIMER S DISEASE (AD) The complex genetic makeup of AD -Genetically divided into two different groups: early-onset and late-onset -Relative risk for first degree relatives is 3.5 7.5-30 48% of AD patients have an affected first-degree relative Early-onset AD: - For 2 10% of patients first symptoms occur in their 20s or 30s. - Four genes account for 5 10% of early onset AD: -APP PSEN1 PSEN2 APOE Late-onset AD: - Manifests after 65 years - Multifactorial with strong genetic predisposition - GWAS have identified 20+ genetic risk loci with small Odds Ratios (1.1 2.0 per risk allele) including both common functional variants and rare and structural variants 5

CANDIDATE DISEASE GENES IN ALZHEIMER S DISEASE (AD) Several decade long search for risk genes in Alzheimer s disease Many associated genetic loci contain several genes Which candidates involved in disease risk remains unclear (20+ genes) Strategies for assessing GWAS candidate genes: -DNA sequencing -Transcriptome sequencing -Proteome studies -Methylome studies Cuyvers E. et al. (2016) Genetic variations underlying Alzheimer's disease: evidence from genome-wide association studies and beyond. Lancet Neurol. 15(8),857-68. 6

SEQUEL SYSTEM Typical Performance -Average read length: 10 18 kb -Consensus accuracy: Achieves QV50 -Throughput per cell: 5 8 Gb -SMRT Cells per run: 1 16 -Movie lengths: 30 minutes 10 hours 7

TYPICAL DATA Read lengths >20 kb Data per SMRT Cell: 5 8 Gb Reads (#) Half of data in reads >20 kb Top 5% of reads >35 kb Maximum read lengths >60 kb Read length (bp) Read length data shown from 30 kb size-selected human library on the Sequel System (10-hour movie, 2.0 chemistry) with a total output of 7.6 Gb. Each Sequel System SMRT Cell 1M generates ~365,000 reads. 8

BENEFITS OF LONG-READ SEQUENCING FOR CHARACTERIZING GENOMIC STRUCTURAL VARIATION Structural variation (SV) is an important contributor to human diversity and disease Example SV Types and Mechanisms SV is also difficult to characterize Targeted SMRT Sequencing allows scientists to directly characterize: Complete Genes (introns & exons) Phased Variants (allelic haplotypes) Repetitive Regions Regulatory Regions (upstream/downstream) Insertions & Deletions Copy Number Variations At high coverage for specific genes or regions of interest across multiple samples. Mechanisms underlying structural variant formation in genomic disorders. Carvalho CM et al. Nat Rev Genet. (2016) 9

GENETIC VARIATION SEQUENCING WITH SMRT SEQUENCING VARIANT TYPE SNPs Small Indels Phasing STRs & VNTRs Mobile Elements Large Insertions, Deletions One PacBio Read Spans Most Variants Indels Phased Alleles Repeat Expansions L1, Alu, SVA Copy Number Variation Structural Variants Phasing (SNVs and SVs) Haplotype Reconstruction Complex Variants Phasing SVs and SNVs Medium to Large SV s Inversions / Translocations Haplotypes Large Structural Rearrangement Assembled PacBio Reads Span Euchromatic Genome Variation 1 10 100 1 kb 10 kb 100 kb 1 Mb 10 Mb 100 Mb Size of Variant 10

ADDITIONALLY CHARACTERIZE TRANSCRIPTOME SPLICE VARIATION WITH LONG-READ SEQUENCING - Proteins and their functions are not only impacted by variants in exonic regions - Variants in regulatory regions (enhancers/promoters, including methylation) and intronic regions can also play an important role - High transcript isoform diversity from alternative splicing - Obtain full-length transcript sequences with Iso-Seq analysis National Human Genome Research Institute. Bioinformatics: Finding genes. (2013) http://www.genome.gov/25020001 11

TRACE VARIANTS TO SPECIFIC ALLELES WITH PHASED HETEROZYGOUS SNPS 12

CASE STUDY: VARIANT SCREENING IN ALZHEIMER S DISEASE WITH LONG-READ SEQUENCING -Genomic and transcriptomic (cdna) capture experiment -Combined data provide better insight on variant-affected gene expression -Gene panel applied to two AD patients (35 candidate genes): Average gdna fragment size: ~6 kb Full-length transcripts ranging from <1 kb ~10 kb 13

PACBIO TARGETED PROBE-BASED CAPTURE WORKFLOW (GENOMIC DNA CAPTURE) Ligate EXPERIMENTAL PIPELINE barcoded adapters Probe hybridization, Genomic DNA Shear to 7 kb Amplification Size selection bead capture, wash (6 kb for multiplex) 1 2 3 4 5 5-9 kb 5-9 kb 8 Analysis 7 Sequencing 6 Amplification and SMRTbell prep. + Size selection INFORMATICS PIPELINE 9 10 11 12 13 Map reads of Phased allelic Phasing with Bin reads by insert to consensus SAMtools haplotype Reference sequence Tertiary analysis 14

BEST PRACTICE SUMMARY: GENOMIC CAPTURE -Save on project costs by multiplexing and spacing probes up to 1 kb. -Multiplex up to 12 samples. -Use PacBio linear barcoded adapters. -High molecular weight DNA required. -Size-selection highly recommended to max. on long-read recovery. -Aim for 100-fold coverage of targeted panel size (full-length gene coverage). 15

AD SAMPLES: SHEARED GDNA QC Recommend starting with HMW gdna (2 µg) 10 kb shear 16

SMRTBELL LIBRARY QC (SIZE-SELECTED) Final library size selected 17

GRCH38 SUBREAD MAPPING RESULTS Skeletal muscle Brain 7.4 GB 2.2 M reads 8.4 GB 2.5 M reads 18

PACBIO TARGETED PROBE-BASED CAPTURE WORKFLOW (TRANSCRIPTOME WITH SIZE SELECTION) EXPERIMENTAL PIPELINE cdna library Size selection Probe hybridization, mrna Amplification bead capture, wash + barcodes (optional) 1 2 3 4 5 5-9 kb 8 Analysis 7 Sequencing 6 Amplification and SMRTbell prep. INFORMATICS PIPELINE 9 10 Iso-Seq analysis Tertiary analysis 19

BEST PRACTICE SUMMARY: CDNA CAPTURE -Recover high-quality RNA transcripts -Size-selection is optional, but helpful for specific fractions. -Targeted capture Iso-Seq analysis is recommended to characterize splice isoforms -Not recommended for characterizing gene expression levels -Aim for min. 30-fold per anticipated splice isoform in samples -Probes can be designed to exons only and/or including introns 20

AD SAMPLES: MRNA QC Temporal lobe 1 RNA RIN = 8.0 Recommend RIN > 6 (RNA Integrity Number) Temporal lobe 2 RNA RIN = 8.1 21

EXAMPLE WHOLE TRANSCRIPTOME SMRTBELL LIBRARY (CDNA) 22

DESIGNING CUSTOM IDT XGEN LOCKDOWN CAPTURE PANEL -Key benefit of xgen Lockdown Probes is flexibility in design -Do not need to redesign existing probe panels -However, recommend full-gene design by including introns and exons, plus extra upstream and downstream sequences -Probes can be spaced up to 1000 bp apart -Use the same probes for genomic and cdna capture FULL-GENE DESIGN Gene A Gene B 23

SNPs AND LARGER SVs DISCOVERED IN AD SAMPLES STUDY RESULTS: Detected broad range of genomic variants (SNPs and SVs): -31 unique SVs ranging from 65 bp to several kb in size 500+ Isoforms found in each patient -Patient 1: 515 isoforms -Patient 2: 507 isoforms 67 3 2 39 154 312 88% novel splice isoforms identified -Only 39 isoform shared among both patients and those reported in Gencode v25 319 24

RIN3 GENE: ~50 bp INSERTION DETECTED 25

ZCWPW1 GENE: ~750 bp DELETION DETECTED IN BOTH PATIENTS Patient 1 Patient 2 26

BACE1 GENE: PHASED ALLELES (34 KB) Heterozygous SNPs can be used to phase alleles across multi-kilobase regions Phase 0 Phase 1 Gene Probes Target Phased SNPs 27

BIN1 GENE: PHASED ALLELES (63 KB) Heterozygous SNPs can be used to phase alleles across multi-kilobase regions Phase 0 Phase 1 Gene Probes Target Phased SNPs 28

MAPT GENE RESULTS FOR PATIENT 1 Heterozygous genomic variants can be linked to corresponding expressed transcripts 21 isoforms MAPT gene results: -Detected a heterozygous deletion -One allele is transcribed into 21 isoforms and the other only into 5 -Detected a novel exon and transcript 5 isoforms 29

ZCWPW1 GENE: RETAINED INTRONS AND NEW EXONS Novel exon Retained intron Patient 1 Patient 2 30

CONCLUSION -AD has a large economic impact on the global society (2010: $604B) -To date, over 20+ putative genetic risk variants have been mapped -Associated SNPs are usually not the true causative variant -Combining gdna and cdna data is more informative -Custom IDT xgen Lockdown Panels allow flexibility to scale projects -SMRT sequencing provides multi-kilobase phased alleles and fulllength transcripts http://www.mvcenters.com/2015/02/11/dementiatakes-toll-claims-another-american-great-dean-smith/ Structural variants can be more informative for disease diagnostics, prognostics and translation than current SNP mapping and exon sequencing. Roses A.D. et al. (2016) Structural variants can be more informative for disease diagnostics, prognostics and translation than current SNP mapping and exon sequencing. Expert Opin Drug Metab Toxicol. 12(2),135-47. 31

ACKNOWLEDGEMENT Kevin Eng Ting Hon Elizabeth Tseng Aaron Wenger William Rowell Jenny Ekholm Steve Kujawa Kristina Giorda Jiashi Wang Mirna Jarosz Visit PacBio Blog for new announcements and updates on Targeted Sequencing! http://www.pacb.com/blog http://www.pacb.com/applications/targeted-sequencing/ Feel free to contact! Jenny Gu (jgu@pacb.com)

www.pacb.com For Research Use Only. Not for use in diagnostics procedures. Copyright 2017 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. FEMTO Pulse and Fragment Analyzer are trademarks of Advanced Analytical Technologies. xgen and Lockdown are trademarks of Integrated DNA Technologies, Inc. All other trademarks are the sole property of their respective owners.

gdna Capture Supplemental Information

PACBIO POLYMERASE READS Skeletal muscle Brain 35

SMRT LINK PROVIDES BASIC PROCESSING OF RAW DATA FOR TARGETED CAPTURE ENRICHMENT STUDIES SMRT Analysis produces: -Filtered subreads -Circular consensus sequences -Alignment to reference (BAM files) -Iso-Seq full-length transcripts 36

BIOINFORMATICS WORKFLOW FOR PHASING ALLELES IGV 3.0 Visualize 1 2 3a 4 5 Raw data SMRTLink CCS reads SMRTLink Aligned BAM file 3b Subreads 6 Probe *.bed 7 capture2target.py 8 Defined phase blocks 11 cmdline: PacBio arrow 10 Phased alleles/region Subset and phase 9 samtools Polish Data 12 Phased consensus sequences (*.fasta) >99.9% accuracy (dependent on coverage) SMRTLink Command line tools Third party software Github: Targeted phasing consensus (genomic capture) 37